Google app script search involving special characters Unterminated parenthetical Error [duplicate] - regex

I just want to create a regular expression out of any possible string.
var usersString = "Hello?!*`~World()[]";
var expression = new RegExp(RegExp.escape(usersString))
var matches = "Hello".match(expression);
Is there a built-in method for that? If not, what do people use? Ruby has RegExp.escape. I don't feel like I'd need to write my own, there have got to be something standard out there.

The function linked in another answer is insufficient. It fails to escape ^ or $ (start and end of string), or -, which in a character group is used for ranges.
Use this function:
function escapeRegex(string) {
return string.replace(/[/\-\\^$*+?.()|[\]{}]/g, '\\$&');
}
While it may seem unnecessary at first glance, escaping - (as well as ^) makes the function suitable for escaping characters to be inserted into a character class as well as the body of the regex.
Escaping / makes the function suitable for escaping characters to be used in a JavaScript regex literal for later evaluation.
As there is no downside to escaping either of them, it makes sense to escape to cover wider use cases.
And yes, it is a disappointing failing that this is not part of standard JavaScript.

For anyone using Lodash, since v3.0.0 a _.escapeRegExp function is built-in:
_.escapeRegExp('[lodash](https://lodash.com/)');
// → '\[lodash\]\(https:\/\/lodash\.com\/\)'
And, in the event that you don't want to require the full Lodash library, you may require just that function!

Most of the expressions here solve single specific use cases.
That's okay, but I prefer an "always works" approach.
function regExpEscape(literal_string) {
return literal_string.replace(/[-[\]{}()*+!<=:?.\/\\^$|#\s,]/g, '\\$&');
}
This will "fully escape" a literal string for any of the following uses in regular expressions:
Insertion in a regular expression. E.g. new RegExp(regExpEscape(str))
Insertion in a character class. E.g. new RegExp('[' + regExpEscape(str) + ']')
Insertion in integer count specifier. E.g. new RegExp('x{1,' + regExpEscape(str) + '}')
Execution in non-JavaScript regular expression engines.
Special Characters Covered:
-: Creates a character range in a character class.
[ / ]: Starts / ends a character class.
{ / }: Starts / ends a numeration specifier.
( / ): Starts / ends a group.
* / + / ?: Specifies repetition type.
.: Matches any character.
\: Escapes characters, and starts entities.
^: Specifies start of matching zone, and negates matching in a character class.
$: Specifies end of matching zone.
|: Specifies alternation.
#: Specifies comment in free spacing mode.
\s: Ignored in free spacing mode.
,: Separates values in numeration specifier.
/: Starts or ends expression.
:: Completes special group types, and part of Perl-style character classes.
!: Negates zero-width group.
< / =: Part of zero-width group specifications.
Notes:
/ is not strictly necessary in any flavor of regular expression. However, it protects in case someone (shudder) does eval("/" + pattern + "/");.
, ensures that if the string is meant to be an integer in the numerical specifier, it will properly cause a RegExp compiling error instead of silently compiling wrong.
#, and \s do not need to be escaped in JavaScript, but do in many other flavors. They are escaped here in case the regular expression will later be passed to another program.
If you also need to future-proof the regular expression against potential additions to the JavaScript regex engine capabilities, I recommend using the more paranoid:
function regExpEscapeFuture(literal_string) {
return literal_string.replace(/[^A-Za-z0-9_]/g, '\\$&');
}
This function escapes every character except those explicitly guaranteed not be used for syntax in future regular expression flavors.
For the truly sanitation-keen, consider this edge case:
var s = '';
new RegExp('(choice1|choice2|' + regExpEscape(s) + ')');
This should compile fine in JavaScript, but will not in some other flavors. If intending to pass to another flavor, the null case of s === '' should be independently checked, like so:
var s = '';
new RegExp('(choice1|choice2' + (s ? '|' + regExpEscape(s) : '') + ')');

Mozilla Developer Network's Guide to Regular Expressions provides this escaping function:
function escapeRegExp(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
}

In jQuery UI's autocomplete widget (version 1.9.1) they use a slightly different regular expression (line 6753), here's the regular expression combined with bobince's approach.
RegExp.escape = function( value ) {
return value.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
}

There is an ES7 proposal for RegExp.escape at https://github.com/benjamingr/RexExp.escape/, with a polyfill available at https://github.com/ljharb/regexp.escape.

Nothing should prevent you from just escaping every non-alphanumeric character:
usersString.replace(/(?=\W)/g, '\\');
You lose a certain degree of readability when doing re.toString() but you win a great deal of simplicity (and security).
According to ECMA-262, on the one hand, regular expression "syntax characters" are always non-alphanumeric, such that the result is secure, and special escape sequences (\d, \w, \n) are always alphanumeric such that no false control escapes will be produced.

There is an ES7 proposal for RegExp.escape at https://github.com/benjamingr/RexExp.escape/, with a polyfill available at https://github.com/ljharb/regexp.escape.
An example based on the rejected ES proposal, includes checks if the property already exists, in the case that TC39 backtracks on their decision.
Code:
if (!Object.prototype.hasOwnProperty.call(RegExp, 'escape')) {
RegExp.escape = function(string) {
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Escaping
// https://github.com/benjamingr/RegExp.escape/issues/37
return string.replace(/[.*+\-?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
};
}
Code Minified:
Object.prototype.hasOwnProperty.call(RegExp,"escape")||(RegExp.escape=function(e){return e.replace(/[.*+\-?^${}()|[\]\\]/g,"\\$&")});
// ...
var assert = require('assert');
var str = 'hello. how are you?';
var regex = new RegExp(RegExp.escape(str), 'g');
assert.equal(String(regex), '/hello\. how are you\?/g');
There is also an npm module at:
https://www.npmjs.com/package/regexp.escape
One can install this and use it as so:
npm install regexp.escape
or
yarn add regexp.escape
var escape = require('regexp.escape');
var assert = require('assert');
var str = 'hello. how are you?';
var regex = new RegExp(escape(str), 'g');
assert.equal(String(regex), '/hello\. how are you\?/g');
In the GitHub && NPM page are descriptions of how to use the shim/polyfill for this option, as well. That logic is based on return RegExp.escape || implementation;, where implementation contains the regexp used above.
The NPM module is an extra dependency, but it also make it easier for an external contributor to identify logical parts added to the code. ¯\(ツ)/¯

Another (much safer) approach is to escape all the characters (and not just a few special ones that we currently know) using the unicode escape format \u{code}:
function escapeRegExp(text) {
return Array.from(text)
.map(char => `\\u{${char.charCodeAt(0).toString(16)}}`)
.join('');
}
console.log(escapeRegExp('a.b')); // '\u{61}\u{2e}\u{62}'
Please note that you need to pass the u flag for this method to work:
var expression = new RegExp(escapeRegExp(usersString), 'u');

This is a shorter version.
RegExp.escape = function(s) {
return s.replace(/[$-\/?[-^{|}]/g, '\\$&');
}
This includes the non-meta characters of %, &, ', and ,, but the JavaScript RegExp specification allows this.

XRegExp has an escape function:
XRegExp.escape('Escaped? <.>');
// -> 'Escaped\?\ <\.>'
More on: http://xregexp.com/api/#escape

escapeRegExp = function(str) {
if (str == null) return '';
return String(str).replace(/([.*+?^=!:${}()|[\]\/\\])/g, '\\$1');
};

Rather than only escaping characters which will cause issues in your regular expression (e.g.: a blacklist), consider using a whitelist instead. This way each character is considered tainted unless it matches.
For this example, assume the following expression:
RegExp.escape('be || ! be');
This whitelists letters, number and spaces:
RegExp.escape = function (string) {
return string.replace(/([^\w\d\s])/gi, '\\$1');
}
Returns:
"be \|\| \! be"
This may escape characters which do not need to be escaped, but this doesn't hinder your expression (maybe some minor time penalties - but it's worth it for safety).

The functions in the other answers are overkill for escaping entire regular expressions (they may be useful for escaping parts of regular expressions that will later be concatenated into bigger regexps).
If you escape an entire regexp and are done with it, quoting the metacharacters that are either standalone (., ?, +, *, ^, $, |, \) or start something ((, [, {) is all you need:
String.prototype.regexEscape = function regexEscape() {
return this.replace(/[.?+*^$|({[\\]/g, '\\$&');
};
And yes, it's disappointing that JavaScript doesn't have a function like this built-in.

I borrowed bobince's answer above and created a tagged template function for creating a RegExp where part of the value is escaped and part isn't.
regex-escaped.js
RegExp.escape = text => text.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, '\\$&');
RegExp.escaped = flags =>
function (regexStrings, ...escaped) {
const source = regexStrings
.map((s, i) =>
// escaped[i] will be undefined for the last value of s
escaped[i] === undefined
? s
: s + RegExp.escape(escaped[i].toString())
)
.join('');
return new RegExp(source, flags);
};
function capitalizeFirstUserInputCaseInsensitiveMatch(text, userInput) {
const [, before, match, after ] =
RegExp.escaped('i')`^((?:(?!${userInput}).)*)(${userInput})?(.*)$`.exec(text);
return `${before}${match.toUpperCase()}${after}`;
}
const text = 'hello (world)';
const userInput = 'lo (wor';
console.log(capitalizeFirstUserInputCaseInsensitiveMatch(text, userInput));
For our TypeScript fans...
global.d.ts
interface RegExpConstructor {
/** Escapes a string so that it can be used as a literal within a `RegExp`. */
escape(text: string): string;
/**
* Returns a tagged template function that creates `RegExp` with its template values escaped.
*
* This can be useful when using a `RegExp` to search with user input.
*
* #param flags The flags to apply to the `RegExp`.
*
* #example
*
* function capitalizeFirstUserInputCaseInsensitiveMatch(text: string, userInput: string) {
* const [, before, match, after ] =
* RegExp.escaped('i')`^((?:(?!${userInput}).)*)(${userInput})?(.*)$`.exec(text);
*
* return `${before}${match.toUpperCase()}${after}`;
* }
*/
escaped(flags?: string): (regexStrings: TemplateStringsArray, ...escapedVals: Array<string | number>) => RegExp;
}

There has only ever been and ever will be 12 meta characters that need to be escaped
to be considered a literal.
It doesn't matter what is done with the escaped string, inserted into a balanced regex wrapper or appended. It doesn't matter.
Do a string replace using this
var escaped_string = oldstring.replace(/[\\^$.|?*+()[{]/g, '\\$&');

This one is the permanent solution.
function regExpEscapeFuture(literal_string) {
return literal_string.replace(/[^A-Za-z0-9_]/g, '\\$&');
}

Just published a regex escape gist based on the RegExp.escape shim which was in turn based on the rejected RegExp.escape proposal. Looks roughly equivalent to the accepted answer except it doesn't escape - characters, which seems to be actually fine according to my manual testing.
Current gist at the time of writing this:
const syntaxChars = /[\^$\\.*+?()[\]{}|]/g
/**
* Escapes all special special regex characters in a given string
* so that it can be passed to `new RegExp(escaped, ...)` to match all given
* characters literally.
*
* inspired by https://github.com/es-shims/regexp.escape/blob/master/implementation.js
*
* #param {string} s
*/
export function escape(s) {
return s.replace(syntaxChars, '\\$&')
}

Related

How do I do regex substitutions with multiple capture groups?

I'm trying to allow users to filter strings of text using a glob pattern whose only control character is *. Under the hood, I figured the easiest thing to filter the list strings would be to use Js.Re.test[https://rescript-lang.org/docs/manual/latest/api/js/re#test_], and it is (easy).
Ignoring the * on the user filter string for now, what I'm having difficulty with is escaping all the RegEx control characters. Specifically, I don't know how to replace the capture groups within the input text to create a new string.
So far, I've got this, but it's not quite right:
let input = "test^ing?123[foo";
let escapeRegExCtrl = searchStr => {
let re = [%re("/([\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+][^\\^\\[\\]\\.\\|\\\\\\?\\{\\}\\+]*)/g")];
let break = ref(false);
while (!break.contents) {
switch (Js.Re.exec_ (re, searchStr)) {
| Some(result) => {
let match = Js.Re.captures(result)[0];
Js.log2("Matching: ", match)
}
| None => {
break := true;
}
}
}
};
search -> escapeRegExCtrl
If I disregard the "test" portion of the string being skipped, the above output will produce:
Matching: ^ing
Matching: ?123
Matching: [foo
With the above example, at the end of the day, what I'm trying to produce is this (with leading and following .*:
.*test\^ing\?123\[foo.*
But I'm unsure how to achieve creating a contiguous string from the matched capture groups.
(echo "test^ing?123[foo" | sed -r 's_([\^\?\[])_\\\1_g' would get the work done on the command line)
EDIT
Based on Chris Maurer's answer, there is a method in the JS library that does what I was looking for. A little digging exposed the ReasonML proxy for that method:
https://rescript-lang.org/docs/manual/latest/api/js/string#replacebyre
Let me see if I have this right; you want to implement a character matcher where everything is literal except *. Presumably the * is supposed to work like that in Windows dir commands, matching zero or more characters.
Furthermore, you want to implement it by passing a user-entered character string directly to a Regexp match function after suitably sanitizing it to only deal with the *.
If I have this right, then it sounds like you need to do two things to get the string ready for js.re.test:
Quote all the special regex characters, and
Turn all instances of * into .* or maybe .*?
Let's keep this simple and process the string in two steps, each one using Js.re.replace. So the list of special characters in regex are [^$.|?*+(). Suitably quoting these for replace:
str.replace(/[\[\\\^\$\.\|\?\+\(\)]/g, '\$&')
This is just all those special characters quoted. The $& in the replacement specifications says to insert whatever matched.
Then pass that result to a second replace for the * to .*? transformation.
str.replace(/*+/g, '.*?')

Adding custom header based on the ngx.re.match()

I'm trying to add custom header based on the uri value, in this case for all the pdf files:
header_filter_by_lua_block {
local m, err = ngx.re.match(ngx.var.uri, "%.pdf$", "io")
if m then
ngx.log(ngx.ERR, "found match: ", m[0])
ngx.header["X-Custom-Header"] = "ZZzz"
end
}
I'm using lua-nginx-module in this task, therefore I expected that standard lua regex syntax should apply, thus %. should match . (dot), however it doesn't seem to work. What's the problem?
If I change regex from %.pdf$ to .pdf$ then it does work, but obviously it matches not just blabla.pdf but also blablapdf.
lua-nginx-module uses PCRE (Perl compatible regular expression), so \ should be used instead of % to escape special characters. Backslash is also Lua string escape symbol, so double escape is needed:
ngx.re.match(ngx.var.uri, "\\.pdf$", "io")
Alternatively, you can use bracket string literals instead of quotes to avoid double escape:
ngx.re.match(ngx.var.uri, [[\.pdf$]], "io")

Escaping Asterisk Grabs wrong character

When I run the code below I get the unexpected result where \* also captures É. Is there a way to make it only capture * like I wanted?
let s =
"* A
ÉR
* B"
let result = System.Text.RegularExpressions.Regex.Replace(s, "\n(?!\*)", "", Text.RegularExpressions.RegexOptions.Multiline)
printfn "%s" result
Result After Running Replace
* AÉR
* B
Expected Result
"* A
ÉR
* B"
UPDATE
This seems to be working, when I use a pattern like so \n(?=\*). I guess I needed a positive lookahead.
You're right that you need to use positive lookahead instead of negative lookahead to get the result you want. However, to clarify an issue that came up in the comments, in F# a string delimited by just "" is not quite like either a plain C# string delimited by "" or a C# string delimited by #"" - if you want the latter you should also use #"" in F#. The difference is that in a normal F# string, backslashes will be treated as escape sequences only when used in front of a valid character to escape (see the table towards the top of Strings (F#)). Otherwise, it is treated as a literal backslash character. So, since '*' is not a valid character to escape, you luckily see the behavior you expected (in C#, by contrast, this would be a syntax error because it's an unrecognized escape). I would recommend that you not rely on this and should use a verbatim #"" string instead.
In other words, in F# the following three strings are all equivalent:
let s1 = "\n\*"
let s2 = "\n\\*"
let s3 = #"
\*"
I think that the C# design is more sensible because it prevents confusion on what exactly is being escaped.

Playframework with Deadbolt 2: Pattern regular expression not match

I am using Deadbolt2 with play-framework 2.3.x. When I am trying to access the controller with declare deadbolt Patterns using regular expressions. I am getting Not-found error. According to this sample, it is possible to use regular expressions with Pattern in our application. But when I declare a regular expression, I am not able to use it. My code looks like this:
def pattern_one = Pattern("CH{4,}", PatternType.REGEX, new MyDeadboltHandler) {} // NOT ACCESSED
def pattern_one = Pattern("CH*", PatternType.REGEX, new MyDeadboltHandler) { // NOT ACCESSED
def pattern_one = Pattern("CHANNEL", PatternType.REGEX, new MyDeadboltHandler) { // ACCESSED SUCCESSFULLY
Regular expressions are not wildcards. If a * wildcard matches anything any number of times, in regex, you need to use .*, where . means any character but a newline, and * means 0 or more times.
More, if you want to match the whole string that contains a word in a string starting with CH, you can use a word boundary, \\b: \\bCH.*.
If you want to specify that the string must start with CH and match the whole string, you can use ^CH.*.
You need to use CH.* or CH.{4,} if you want something (not just Hs) after CH. The . means any character, just like in any other regular expression.

Regex to capture VBA comments

I'm trying to capture VBA comments. I have the following so far
'[^";]+\Z
Which captures anything that starts with a single quote but not contain any double quotes until end of string. i.e. it will not match single quotes within a double quote string.
dim s as string ' a string variable -- works
s = "the cat's hat" ' quote within string -- works
But fails if the comment contains a double quote string
i.e.
dim s as string ' string should be set to "ten"
How can I fix my regex to handle that too?
The pattern in #Jeff Wurz's comment (^\'[^\r\n]+$|''[^\r\n]+$) doesn't even match any of your test samples, and the linked question is useless, the regex in there will only match that specific comment in the OP's question, not "the VBA comment syntax".
The regex you have come up with works even better than what I had when I gave up the regex approach.
Well done!
The problem is that you can't parse VBA comments with a regex.
In Lexers vs Parsers, #SasQ's answer does a good job at explaining Chomsky's grammar levels:
Level 3: Regular grammars
They use regular expressions, that is, they can consist only of the
symbols of alphabet (a,b), their concatenations (ab,aba,bbb etd.), or
alternatives (e.g. a|b). They can be implemented as finite state
automata (FSA), like NFA (Nondeterministic Finite Automaton) or better
DFA (Deterministic Finite Automaton). Regular grammars can't handle
with nested syntax, e.g. properly nested/matched parentheses
(()()(()())), nested HTML/BBcode tags, nested blocks etc. It's because
state automata to deal with it should have to have infinitely many
states to handle infinitely many nesting levels.
Level 2: Context-free grammars
They can have nested, recursive, self-similar branches in their syntax
trees, so they can handle with nested structures well. They can be
implemented as state automaton with stack. This stack is used to
represent the nesting level of the syntax. In practice, they're
usually implemented as a top-down, recursive-descent parser which uses
machine's procedure call stack to track the nesting level, and use
recursively called procedures/functions for every non-terminal symbol
in their syntax. But they can't handle with a context-sensitive
syntax. E.g. when you have an expression x+3 and in one context this x
could be a name of a variable, and in other context it could be a name
of a function etc.
Level 1: Context-sensitive grammars
Regular Expressions simply aren't the appropriate tool for solving this problem, because whenever there's more than a single quote (/apostrophe), or when double quotes are involved, you need to figure out whether the left-most apostrophe in the code line is inside double quotes, and if it is, then you need to match the double quotes and find the left-most apostrophe after the closing double quote - actually, the left-most apostrophe that isn't part of a string literal, is your comment marker.
My understanding is that VBA comment syntax is a context-sensitive grammar (level 1), because the apostrophe is only your marker if it's not part of a string literal, and to figure out whether an apostrophe is part of a string literal, the easiest is probably to walk your string left to right and to toggle some IsInsideQuote flag as you encounter double-quotes... but only if they're not escaped (doubled-up). Actually you don't even check to see if there's an apostrophe inside the string litereal: you just keep walking until open quotes are closed, and only when the "in-quotes flag" is False you found a comment marker if you encounter a single quote.
Good luck!
Here's a test case you're missing:
s = "abc'def ""xyz""'nutz!" 'string with apostrophes and escaped double quotes
If you don't care about capturing the string literals, you can simply ignore the escaped double quotes and see 3 string literals here: "abc'def ", "xyz" and "'nutz!".
This C# code outputs 'string with apostrophes and escaped double quotes (all in-string double quotes are escaped with a backslash in the code), and works with all the test strings I gave it:
static void Main(string[] args)
{
var instruction = "s = \"abc'def \"\"xyz\"\"'nutz!\" 'string with apostrophes and escaped double quotes";
// var instruction = "s = \"the cat's hat\" ' quote within string -- works";
// var instruction = "dim s as string ' string should be set to \"ten\"";
int? commentStart = null;
var isInsideQuotes = false;
for (var i = 0; i < instruction.Length; i++)
{
if (instruction[i] == '"')
{
isInsideQuotes = !isInsideQuotes;
}
if (!isInsideQuotes && instruction[i] == '\'')
{
commentStart = i;
break;
}
}
if (commentStart.HasValue)
{
Console.WriteLine(instruction.Substring(commentStart.Value));
}
Console.ReadLine();
}
Then if you want to capture all legal comments, you need to handle the legacy Rem keyword, and consider line continuations:
Rem this is a legal comment
' this _
is also _
a legal comment
In other words, \r\n in itself isn't enough to correctly identify all end-of-statement tokens.
A proper lexer+parser seems the only way to capture all comments.