Adding custom header based on the ngx.re.match() - regex

I'm trying to add custom header based on the uri value, in this case for all the pdf files:
header_filter_by_lua_block {
local m, err = ngx.re.match(ngx.var.uri, "%.pdf$", "io")
if m then
ngx.log(ngx.ERR, "found match: ", m[0])
ngx.header["X-Custom-Header"] = "ZZzz"
end
}
I'm using lua-nginx-module in this task, therefore I expected that standard lua regex syntax should apply, thus %. should match . (dot), however it doesn't seem to work. What's the problem?
If I change regex from %.pdf$ to .pdf$ then it does work, but obviously it matches not just blabla.pdf but also blablapdf.

lua-nginx-module uses PCRE (Perl compatible regular expression), so \ should be used instead of % to escape special characters. Backslash is also Lua string escape symbol, so double escape is needed:
ngx.re.match(ngx.var.uri, "\\.pdf$", "io")
Alternatively, you can use bracket string literals instead of quotes to avoid double escape:
ngx.re.match(ngx.var.uri, [[\.pdf$]], "io")

Related

Replace every " with \" in Lua

X-Problem: I want to dump an entire lua-script to a single string-line, which can be compiled into a C-Program afterwards.
Y-Problem: How can you replace every " with \" ?
I think it makes sense to try something like this
data = string.gsub(line, "c", "\c")
where c is the "-character. But this does not work of course.
You need to escape both quotes and backslashes, if I understand your Y problem:
data = string.gsub(line, "\"", "\\\"")
or use the other single quotes (still escape the backslash):
data = string.gsub(line, '"', '\\"')
A solution to your X-Problem is to safely escape any sequence that could interfere with the interpreter.
Lua has the %q option for string.format that will format and escape the provided string in such a way, that it can be safely read back by Lua. It should be also true for your C interpreter.
Example string: This \string's truly"tricky
If you just enclosed it in either single or double-quotes, there'd still be a quote that ended the string early. Also there's the invalid escape sequence \s.
Imagine this string was already properly handled in Lua, so we'll just pass it as a parameter:
string.format("%q", 'This \\string\'s truly"tricky')
returns (notice, I used single-quotes in code input):
"This \\string's truly\"tricky"
Now that's a completely valid Lua string that can be written and read from a file. No need to manually escape every special character and risk implementation mistakes.
To correctly implement your Y approach, to escape (invalid) characters with \, use proper pattern matching to replace the captured string with a prefix+captured string:
string.gsub('he"ll"o', "[\"']", "\\%1") -- will prepend backslash to any quote

Google app script search involving special characters Unterminated parenthetical Error [duplicate]

I just want to create a regular expression out of any possible string.
var usersString = "Hello?!*`~World()[]";
var expression = new RegExp(RegExp.escape(usersString))
var matches = "Hello".match(expression);
Is there a built-in method for that? If not, what do people use? Ruby has RegExp.escape. I don't feel like I'd need to write my own, there have got to be something standard out there.
The function linked in another answer is insufficient. It fails to escape ^ or $ (start and end of string), or -, which in a character group is used for ranges.
Use this function:
function escapeRegex(string) {
return string.replace(/[/\-\\^$*+?.()|[\]{}]/g, '\\$&');
}
While it may seem unnecessary at first glance, escaping - (as well as ^) makes the function suitable for escaping characters to be inserted into a character class as well as the body of the regex.
Escaping / makes the function suitable for escaping characters to be used in a JavaScript regex literal for later evaluation.
As there is no downside to escaping either of them, it makes sense to escape to cover wider use cases.
And yes, it is a disappointing failing that this is not part of standard JavaScript.
For anyone using Lodash, since v3.0.0 a _.escapeRegExp function is built-in:
_.escapeRegExp('[lodash](https://lodash.com/)');
// → '\[lodash\]\(https:\/\/lodash\.com\/\)'
And, in the event that you don't want to require the full Lodash library, you may require just that function!
Most of the expressions here solve single specific use cases.
That's okay, but I prefer an "always works" approach.
function regExpEscape(literal_string) {
return literal_string.replace(/[-[\]{}()*+!<=:?.\/\\^$|#\s,]/g, '\\$&');
}
This will "fully escape" a literal string for any of the following uses in regular expressions:
Insertion in a regular expression. E.g. new RegExp(regExpEscape(str))
Insertion in a character class. E.g. new RegExp('[' + regExpEscape(str) + ']')
Insertion in integer count specifier. E.g. new RegExp('x{1,' + regExpEscape(str) + '}')
Execution in non-JavaScript regular expression engines.
Special Characters Covered:
-: Creates a character range in a character class.
[ / ]: Starts / ends a character class.
{ / }: Starts / ends a numeration specifier.
( / ): Starts / ends a group.
* / + / ?: Specifies repetition type.
.: Matches any character.
\: Escapes characters, and starts entities.
^: Specifies start of matching zone, and negates matching in a character class.
$: Specifies end of matching zone.
|: Specifies alternation.
#: Specifies comment in free spacing mode.
\s: Ignored in free spacing mode.
,: Separates values in numeration specifier.
/: Starts or ends expression.
:: Completes special group types, and part of Perl-style character classes.
!: Negates zero-width group.
< / =: Part of zero-width group specifications.
Notes:
/ is not strictly necessary in any flavor of regular expression. However, it protects in case someone (shudder) does eval("/" + pattern + "/");.
, ensures that if the string is meant to be an integer in the numerical specifier, it will properly cause a RegExp compiling error instead of silently compiling wrong.
#, and \s do not need to be escaped in JavaScript, but do in many other flavors. They are escaped here in case the regular expression will later be passed to another program.
If you also need to future-proof the regular expression against potential additions to the JavaScript regex engine capabilities, I recommend using the more paranoid:
function regExpEscapeFuture(literal_string) {
return literal_string.replace(/[^A-Za-z0-9_]/g, '\\$&');
}
This function escapes every character except those explicitly guaranteed not be used for syntax in future regular expression flavors.
For the truly sanitation-keen, consider this edge case:
var s = '';
new RegExp('(choice1|choice2|' + regExpEscape(s) + ')');
This should compile fine in JavaScript, but will not in some other flavors. If intending to pass to another flavor, the null case of s === '' should be independently checked, like so:
var s = '';
new RegExp('(choice1|choice2' + (s ? '|' + regExpEscape(s) : '') + ')');
Mozilla Developer Network's Guide to Regular Expressions provides this escaping function:
function escapeRegExp(string) {
return string.replace(/[.*+?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
}
In jQuery UI's autocomplete widget (version 1.9.1) they use a slightly different regular expression (line 6753), here's the regular expression combined with bobince's approach.
RegExp.escape = function( value ) {
return value.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, "\\$&");
}
There is an ES7 proposal for RegExp.escape at https://github.com/benjamingr/RexExp.escape/, with a polyfill available at https://github.com/ljharb/regexp.escape.
Nothing should prevent you from just escaping every non-alphanumeric character:
usersString.replace(/(?=\W)/g, '\\');
You lose a certain degree of readability when doing re.toString() but you win a great deal of simplicity (and security).
According to ECMA-262, on the one hand, regular expression "syntax characters" are always non-alphanumeric, such that the result is secure, and special escape sequences (\d, \w, \n) are always alphanumeric such that no false control escapes will be produced.
There is an ES7 proposal for RegExp.escape at https://github.com/benjamingr/RexExp.escape/, with a polyfill available at https://github.com/ljharb/regexp.escape.
An example based on the rejected ES proposal, includes checks if the property already exists, in the case that TC39 backtracks on their decision.
Code:
if (!Object.prototype.hasOwnProperty.call(RegExp, 'escape')) {
RegExp.escape = function(string) {
// https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide/Regular_Expressions#Escaping
// https://github.com/benjamingr/RegExp.escape/issues/37
return string.replace(/[.*+\-?^${}()|[\]\\]/g, '\\$&'); // $& means the whole matched string
};
}
Code Minified:
Object.prototype.hasOwnProperty.call(RegExp,"escape")||(RegExp.escape=function(e){return e.replace(/[.*+\-?^${}()|[\]\\]/g,"\\$&")});
// ...
var assert = require('assert');
var str = 'hello. how are you?';
var regex = new RegExp(RegExp.escape(str), 'g');
assert.equal(String(regex), '/hello\. how are you\?/g');
There is also an npm module at:
https://www.npmjs.com/package/regexp.escape
One can install this and use it as so:
npm install regexp.escape
or
yarn add regexp.escape
var escape = require('regexp.escape');
var assert = require('assert');
var str = 'hello. how are you?';
var regex = new RegExp(escape(str), 'g');
assert.equal(String(regex), '/hello\. how are you\?/g');
In the GitHub && NPM page are descriptions of how to use the shim/polyfill for this option, as well. That logic is based on return RegExp.escape || implementation;, where implementation contains the regexp used above.
The NPM module is an extra dependency, but it also make it easier for an external contributor to identify logical parts added to the code. ¯\(ツ)/¯
Another (much safer) approach is to escape all the characters (and not just a few special ones that we currently know) using the unicode escape format \u{code}:
function escapeRegExp(text) {
return Array.from(text)
.map(char => `\\u{${char.charCodeAt(0).toString(16)}}`)
.join('');
}
console.log(escapeRegExp('a.b')); // '\u{61}\u{2e}\u{62}'
Please note that you need to pass the u flag for this method to work:
var expression = new RegExp(escapeRegExp(usersString), 'u');
This is a shorter version.
RegExp.escape = function(s) {
return s.replace(/[$-\/?[-^{|}]/g, '\\$&');
}
This includes the non-meta characters of %, &, ', and ,, but the JavaScript RegExp specification allows this.
XRegExp has an escape function:
XRegExp.escape('Escaped? <.>');
// -> 'Escaped\?\ <\.>'
More on: http://xregexp.com/api/#escape
escapeRegExp = function(str) {
if (str == null) return '';
return String(str).replace(/([.*+?^=!:${}()|[\]\/\\])/g, '\\$1');
};
Rather than only escaping characters which will cause issues in your regular expression (e.g.: a blacklist), consider using a whitelist instead. This way each character is considered tainted unless it matches.
For this example, assume the following expression:
RegExp.escape('be || ! be');
This whitelists letters, number and spaces:
RegExp.escape = function (string) {
return string.replace(/([^\w\d\s])/gi, '\\$1');
}
Returns:
"be \|\| \! be"
This may escape characters which do not need to be escaped, but this doesn't hinder your expression (maybe some minor time penalties - but it's worth it for safety).
The functions in the other answers are overkill for escaping entire regular expressions (they may be useful for escaping parts of regular expressions that will later be concatenated into bigger regexps).
If you escape an entire regexp and are done with it, quoting the metacharacters that are either standalone (., ?, +, *, ^, $, |, \) or start something ((, [, {) is all you need:
String.prototype.regexEscape = function regexEscape() {
return this.replace(/[.?+*^$|({[\\]/g, '\\$&');
};
And yes, it's disappointing that JavaScript doesn't have a function like this built-in.
I borrowed bobince's answer above and created a tagged template function for creating a RegExp where part of the value is escaped and part isn't.
regex-escaped.js
RegExp.escape = text => text.replace(/[\-\[\]{}()*+?.,\\\^$|#\s]/g, '\\$&');
RegExp.escaped = flags =>
function (regexStrings, ...escaped) {
const source = regexStrings
.map((s, i) =>
// escaped[i] will be undefined for the last value of s
escaped[i] === undefined
? s
: s + RegExp.escape(escaped[i].toString())
)
.join('');
return new RegExp(source, flags);
};
function capitalizeFirstUserInputCaseInsensitiveMatch(text, userInput) {
const [, before, match, after ] =
RegExp.escaped('i')`^((?:(?!${userInput}).)*)(${userInput})?(.*)$`.exec(text);
return `${before}${match.toUpperCase()}${after}`;
}
const text = 'hello (world)';
const userInput = 'lo (wor';
console.log(capitalizeFirstUserInputCaseInsensitiveMatch(text, userInput));
For our TypeScript fans...
global.d.ts
interface RegExpConstructor {
/** Escapes a string so that it can be used as a literal within a `RegExp`. */
escape(text: string): string;
/**
* Returns a tagged template function that creates `RegExp` with its template values escaped.
*
* This can be useful when using a `RegExp` to search with user input.
*
* #param flags The flags to apply to the `RegExp`.
*
* #example
*
* function capitalizeFirstUserInputCaseInsensitiveMatch(text: string, userInput: string) {
* const [, before, match, after ] =
* RegExp.escaped('i')`^((?:(?!${userInput}).)*)(${userInput})?(.*)$`.exec(text);
*
* return `${before}${match.toUpperCase()}${after}`;
* }
*/
escaped(flags?: string): (regexStrings: TemplateStringsArray, ...escapedVals: Array<string | number>) => RegExp;
}
There has only ever been and ever will be 12 meta characters that need to be escaped
to be considered a literal.
It doesn't matter what is done with the escaped string, inserted into a balanced regex wrapper or appended. It doesn't matter.
Do a string replace using this
var escaped_string = oldstring.replace(/[\\^$.|?*+()[{]/g, '\\$&');
This one is the permanent solution.
function regExpEscapeFuture(literal_string) {
return literal_string.replace(/[^A-Za-z0-9_]/g, '\\$&');
}
Just published a regex escape gist based on the RegExp.escape shim which was in turn based on the rejected RegExp.escape proposal. Looks roughly equivalent to the accepted answer except it doesn't escape - characters, which seems to be actually fine according to my manual testing.
Current gist at the time of writing this:
const syntaxChars = /[\^$\\.*+?()[\]{}|]/g
/**
* Escapes all special special regex characters in a given string
* so that it can be passed to `new RegExp(escaped, ...)` to match all given
* characters literally.
*
* inspired by https://github.com/es-shims/regexp.escape/blob/master/implementation.js
*
* #param {string} s
*/
export function escape(s) {
return s.replace(syntaxChars, '\\$&')
}

Regular expression with backslash in Python3

I'm trying to match a specific substring in one string with regular expression, like matching "\ue04a" in "\ue04a abc". But something seems to be wrong. Here's my code:
m = re.match('\\([ue]+\d+[a-z]+)', "\ue04a abc").
The returned m is an empty object, even I tried using three backslashes in the pattern. What's wrong?
Backslashes in regular expressions in Python are extremely tricky. With regular strings (single or triple quotes) there are two passes of backslash interpretation: first, Python itself interprets backslashes (so "\t" represents a single character, a literal tab) and then the result is passed to the regular expression engine, which has its own semantics for any remaining backslashes.
Generally, using r"\t" is strongly recommended, because this removes the Python string parsing aspect. This string, with the r prefix, undergoes no interpretation by Python -- every character in the string simply represents itself, including backslash. So this particular example represents a string of length two, containing the literal characters backslash \ and t.
It's not clear from your question whether the target string "\ue04a abc" should be interpreted as a string of length five containing the Unicode character U+E04A (which is in the Private Use Area, aka PUA, meaning it doesn't have any specific standard use) followed by space, a, b, c -- in which case you would use something like
m = re.match(r'[\ue000-\uf8ff]', "\ue04a abc")
to capture any single code point in the traditional Basic Multilingual Plane PUA; -- or if you want to match a literal string which begins with the two characters backslash \ and u, followed by four hex digits:
m = re.match(r'\\u[0-9a-fA-F]{4}', r"\ue04a abc")
where the former is how Python (and hence most Python programmers) would understand your question, but both interpretations are plausible.
The above show how to match the "mystery sequence" "\ue04a"; it should not then be hard to extend the code to match a longer string containing this sequence.
This should help.
import re
m = re.match(r'(\\ue\d+[a-z]+)', r"\ue04a abc")
if m:
print( m.group() )
Output:
\ue04a

Escaping invalid markdown using python regex

I've been trying to write some python to escape 'invalid' markdown strings.
This is for use with a python library (python-telegram-bot) which requires unused markdown characters to be escaped with a \.
My aim is to match lone *,_,` characters, as well as invalid hyperlinks - eg, if no link is provided, and escape them.
An example of what I'm looking for is:
*hello* is fine and should not be changed, whereas hello* would become hello\*. On top of that, if values are nested, they should not be escaped - eg _hello*_ should remain unchanged.
My thought was to match all the doubles first, and then replace any leftover lonely characters. I managed a rough version of this using re.finditer():
def parser(txt):
match_md = r'(\*)(.+?)(\*)|(\_)(.+?)(\_)|(`)(.+?)(`)|(\[.+?\])(\(.+?\))|(?P<astx>\*)|(?P<bctck>`)|(?P<undes>_)|(?P<sqbrkt>\[)'
for e in re.finditer(match_md, txt):
if e.group('astx') or e.group('bctck') or e.group('undes') or e.group('sqbrkt'):
txt = txt[:e.start()] + '\\' + txt[e.start():]
return txt
note: regex was written to match *text*, _text_, `text`, [text](url), and then single *, _, `, [, knowing the last groups
But the issue here, is of course that the offset changes as you insert more characters, so everything shifts away. Surely there's a better way to do this than adding an offset counter?
I tried to use re.sub(), but I haven't been able to find how to replace a specific group, or had any luck with (?:) to 'not match' the valid markdown.
This was my re.sub attempt:
def test(txt):
match_md = r'(?:(\*)(.+?)(\*))|' \
'(?:(\_)(.+?)(\_))|' \
'(?:(`)(.+?)(`))|' \
'(?:(\[.+?\])(\(.+?\)))|' \
'(\*)|' \
'(`)|' \
'(_)|' \
'(\[)'
return re.sub(match_md, "\\\\\g<0>", txt)
This just prefixed every match with a backslash (which was expected, but I'd hoped the ?: would stop them being matched.)
Bonus would be if \'s already in the string were escaped too, so that they wouldn't interfere with the markdown present - this could be a source of error, as the library would see it as escaped, causing it see the rest as invalid.
Thanks in advance!
You are probably looking for a regular expression like this:
def test(txt):
match_md = r'((([_*]).+?\3[^_*]*)*)([_*])'
return re.sub(match_md, "\g<1>\\\\\g<4>", txt)
Note that for clarity I just made up a sample for * and _. You can expand the list in the [] brackets easily. Now let's take a look at this thing.
The idea is to crunch through strings that look like *foo_* or _bar*_ followed by text that doesn't contain any specials. The regex that matches such a string is ([_*]).+?\1[^_*]*: We match an opening delimiter, save it in \1, and go further along the line until we see the same delimiter (now closing). Then we eat anything behind that that doesn't contain any delimiters.
Now we want to do that as long as no more delimited strings remain, that's done with (([_*]).+?\2[^_*]*)*. What's left on the right side now, if anything, is an isolated special, and that's what we need to mask. After the match we have the following sub matches:
g<0> : the whole match
g<1> : submatch of ((([_*]).+?\3[^_*]*)*)
g<2> : submatch of (([_*]).+?\3[^_*]*)
g<3> : submatch of ([_*]) (hence the \3 above)
g<4> : submatch of ([_*]) (the one to mask)
What's left to you now is to find a way how to treat the invalid hyperlinks, that's another topic.
Update:
Unfortunately this solution masks out valid markdown such as *hello* (=> \*hello\*). The work around to fix this would be to add a special char to the end of line and remove the masked special char once the substitution is done. OP might be looking for a better solution.

Is there an R function to escape a string for regex characters

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.
Some languages have functions that will do this for you (e.g. python re.escape: https://stackoverflow.com/a/10013356/1900520). Does R have such a function?
For example (made up function):
x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"
I've written an R version of Perl's quotemeta function:
library(stringr)
quotemeta <- function(string) {
str_replace_all(string, "(\\W)", "\\\\\\1")
}
I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.
Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:
This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:
$pattern =~ s/(\W)/\\$1/g;
As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):
Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.
which reinforces my point that this solution is only guaranteed for PCRE.
Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':
gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)
My previous answer:
I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.
re.escape <- function(strings){
vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)",
"\\{", "\\}", "\\^", "\\$","\\*",
"\\+", "\\?", "\\.", "\\|")
replace.vals <- paste0("\\\\", vals)
for(i in seq_along(vals)){
strings <- gsub(vals[i], replace.vals[i], strings)
}
strings
}
Some output
> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"
An easier way than #ryanthompson function is to simply prepend \\Q and postfix \\E to your string. See the help file ?base::regex.
Use the rex package
These days, I write all my regular expressions using rex. For your specific example, rex does exactly what you want:
library(rex)
library(assertthat)
x = "foo[bar]"
y = rex(x)
assert_that(y == "foo\\[bar\\]")
But of course, rex does a lot more than that. The question mentions building a regex, and that's exactly what rex is designed for. For example, suppose we wanted to match the exact string in x, with nothing before or after:
x = "foo[bar]"
y = rex(start, x, end)
Now y is ^foo\[bar\]$ and will only match the exact string contained in x.
According to ?regex:
The symbol \w matches a ‘word’ character (a synonym for [[:alnum:]_], an extension) and \W is its negation ([^[:alnum:]_]).
Therefore, using capture groups, (\\W), we can detect the occurrences of non-word characters and escape it with the \\1-syntax:
> gsub("(\\W)", "\\\\\\1", "[](){}.|^+$*?\\These are words")
[1] "\\[\\]\\(\\)\\{\\}\\.\\|\\^\\+\\$\\*\\?\\\\These\\ are\\ words"
Or similarly, replacing "([^[:alnum:]_])" for "(\\W)".