Invalid regular expression - Invalid property name in character class - regex

I am using a fastify server, containing a typescript file that calls a function, which make sure people won't send unwanted characters. Here is the function :
const SAFE_STRING_REPLACE_REGEXP = /[^\p{Latin}\p{Zs}\p{M}\p{Nd}\-\'\s]/gu;
function secure(text:string) {
return text.replace(SAFE_STRING_REPLACE_REGEXP, "").trim();
}
But when I try to launch my server, I got an error message :
"Invalid regular expression - Invalid property name in character class".
It used to work just fine with my previous regex :
const SAFE_STRING_REPLACE_REGEXP = /[^0-9a-zA-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð\-\s\']/g;
function secure(text:string) {
return text.replace(SAFE_STRING_REPLACE_REGEXP, "").trim();
}
But I have been told it wasn't optimized enough. I have also been told it's better to use split/join than regex/replace in matter of performances, but I don't know if I can use it in my case.

You need to use
const SAFE_STRING_REPLACE_REGEXP = /[^\p{Script=Latin}\p{Zs}\p{M}\p{Nd}'\s-]/gu;
// or
const SAFE_STRING_REPLACE_REGEXP = /[^\p{sc=Latin}\p{Zs}\p{M}\p{Nd}'\s-]/gu;
You need to prefix scripts with sc= or Script= in Unicode category classes, so \p{Latin} should be specified as \p{Script=Latin}. See the ECMAScript reference.
Also, when you use the u flag, you cannot escape non-special chars, so do not escape ' and better move the - char to the end of the character class.
You can use split&join, too:
const SAFE_STRING_REPLACE_REGEXP = /[^\p{Script=Latin}\p{Zs}\p{M}\p{Nd}'\s-]/u;
console.log("Ącki-Łał русский!!!中国".split(SAFE_STRING_REPLACE_REGEXP).join(""))
Note you don't need the g modifier with split, it is the default behavior.

Related

Efficient C++ way of giving literal meaning to special symbols (") in a C++ string

I want to put this:
<script src = "Script2.js" type = "text/javascript"> < / script>
in a std::string so I append a (\) symbol before every double quotes (") to give it a literal meaning of ", instead of a string demarcation in C++ like this:
std::string jsFilesImport = "<script src = \"Script2.js\" type = \"text/javascript\"> < / script>\""
If I have a big string with many ("), adding (\) for every (") becomes difficult. Is there a simple way to achieve this in C++?
The easiest way is to use a raw string literal:
std::string s = R"x(<script src = "Script2.js" type = "text/javascript"> < / script>)x";
// ^^^^ ^^^
You just need to take care that the x( )x delimiters aren't used in the text provided inside. All other characters appearing within these delimiters are rendered as is (including newlines). x is arbitrarily chosen, could be something(, )somethingas well.
To prevent further questions regarding this:
No, it's not possible to do1 something like
std::string s = R"resource(
#include "MyRawTextResource.txt"
)resource";
The preprocessor won't recognize the #include "MyRawTextResource.txt" statement, because it's enclosed within a pair of double quotes (").
For such case consider to introduce your custom pre-build step doing the text replacement before compilation.
1)At least I wasn't able to find a workaround. Appreciate if someone proves me wrong about that.

Why is the flex regex being skipped?

I can't, for the life of me, figure out what's wrong with my regex's.
What I'd like to tokenize are two (2) types of strings, both of which to be contained on a single line. One string can be anything (other than a new line), and the other, any alpha-numeric (ASCII) character and literal '_', '/' '-', and '.'.
The snippet of flex code is:
nl \n|\r\n|\r|\f|\n\r
...
%%
...
\"[^\"]+{nl} { frx_parser_error("Label is missing trailing double quote."); }
\"[a-zA-Z0-9_\.\/\-]+\" {
if (yyleng > 1024) frx_parser_error("File name too long.");
yytext[yyleng - 1] = '\0';
frx_parser_lval.str = strdup(yytext+1);
fprintf(stderr,"TOSP_FILENAME: %s\n", frx_parser_lval.str);
return (TOSP_FILENAME);
}
\"[^{nl}]+\" {
yytext[yyleng - 1] = '\0';
frx_parser_lval.str = strdup(yytext+1);
fprintf(stderr,"TOSP_IDENTIFIER:\n%s\n", frx_parser_lval.str);
return (TOSP_IDENTIFIER);
}
And when I run the parser, the fprintf's spit this out:
TOSP_FILENAME: ModStar-Picture-Analysis.txt
TOSP_FILENAME: ModStar-Rubric.log.txt
TOSP_IDENTIFIER:
picture-A"
Progress (26,255) camera 'C' root("picture-C-
Syntax (line 34): syntax error
For whatever reason, the quote after picture-A is being ... missed. Why? I checked the ASCII values for the eight locations the quote character appears and they're all 0x22 (where the double quutoes appear that is).
If I add some characters to the end of the "picture-A" it can work sometimes; adding ".par", ".pbr" doesn't work as expected, but ".pnr" does.
I've even added a specific non-regexy token:
\"picture-A\" { frx_parser_lval.str = strdup("picture-A"); return TOSP_FILENAME; }
to the lex file and it gets skipped.
I'm using flex 2.5.39, no flex libraries, one option (%option prefix=frx_parser_) in the lex file and the flex command line is:
flex -t script-lexer.l > script-lexer.c
What gives?
EDIT I need to test this on the actual system, but unit tests show this tokenizer to be much more robust (based on rici's answer):
nl \n|\r\n|\r|\f|\n\r
...
%%
...
["][^"]+{nl} { printf("Missing trailing quote.\n%s\n",yytext); }
["][[:alnum:]_./-]+["] { printf("File name:\n%s\n",yytext); }
["][^"]+["] { printf("String:\n%s\n",yytext); }
EDIT The rule ["].+["] swallows consecutive multiple strings as one big string. It was changed to ["][^"]+["]
The problem is your pattern:
\"[^{nl}]+\"
You're attempting to expand a definition inside a character class, but that is not possible; inside a character class, { is always just a {, not a flex operator. See the flex manual:
Note that inside of a character class, all regular expression operators lose their special meaning except escape (‘\’) and the character class operators, ‘-’, ‘]]’, and, at the beginning of the class, ‘^’.
A definition is not a macro. Rather, a definition defines a new regular expression operator.
As a consequence of the above, you can write [^\"] as simply [^"] and \"[a-zA-Z0-9_\.\/\-]+\" as \"[a-zA-Z0-9_./-]+\" (The - needs to be either at the end or at the beginning.) Personally, I'd write the second pattern as:
["][[:alnum:]_./-]+["]
But everyone has their own style.

Return value of std.regex.regex?

I'm trying to write a function that takes an input string, a regex (made by std.regex.regex from a rawstring) and an error message string, and attempt to match something from the input string using the regex, displaying the error message if there are no matches. I came up with the following signature so far:
string check_for_match (string input, Regex r, string error_message)
However, this doesn't seem to work, as the compiler complains, saying:
struct std.regex.Regex(Char) is used as a type
So what should I use instead?
It'll compile if you change Regex to Regex!char.
The reason is that Regex is a template that can use any character size: char for UTF-8 patterns, wchar for UTF-16, or dchar for UTF-32. The compiler is saying you need to create a type by passing the required Char argument there to use it here.
Since you are working with string, which is made up of chars, Regex!char is the type to use.
string check_for_match (string input, Regex!char r, string error_message) { return null; }

How to create conditional if statement based on value + wildcard in Python?

I have a string that may be either:
my_string = "part1"
or:
my_string = "part1/part2"
I need to handle each of the above scenarios conditionally ie (pseudo code):
if my_string = "part1/" + *:
# do this
where * could be any value.
Once I can catch this condition, I will split my_string and assign the second part of the path to a new variable ie:
my_new_string = my_string.split("/")[1]
Is it possible to set up this sort of 'wildcard'?
Edit:
Actually, I just realised I could probably do something like:
if "/" in my_string:
my_new_string = my_string.split("/")[1]
I'd still be interested to know about whether such a 'wildcard' operation exists.
Well, you can always use Regular Expressions to match the condition see: http://docs.python.org/2/library/re.html#re.match
re.match(r'part1/.+', your_string)
Note the + instead of the * to make sure a string follows after the /

Regular Expression for removing suffix

What is the regular expression for removing the suffix of file names? For example, if I have a file name in a string such as "vnb.txt", what is the regular expression to remove ".txt"?
Thanks.
Do you really need a regular expression to do this? Why not just look for the last period in the string, and trim the string up to that point? Frankly, there's a lot of overhead for a regular expression, and I don't think you need it in this case.
As suggested by tstenner, you can try one of the following, depending on what kinds of strings you're using:
std::strrchr
std::string::find_last_of
First example:
char* str = "Directory/file.txt";
size_t index;
char* pStr = strrchr(str,'.');
if(nullptr != pStr)
{
index = pStr - str;
}
Second example:
int index = string("Directory/file.txt").find_last_of('.');
If you are using Qt already, you could use QFileInfo, and use the baseName() function to get just the name (if one exists), or the suffix() function to get the extension (if one exists).
If you're looking for a solution that will give you anything except for the suffix, you should use string::find_last_of.
Your code could look like this:
const std::string removesuffix(const std::string& s) {
size_t suffixbegin = s.find_last_of('.');
//This will handle cases like "directory.foo/bar"
size_t dir = s.find_last_of('/');
if(dir != std::string::npos && dir > suffixbegin) return s;
if(suffixbegin == std::string::npos) return s;
else return s.substr(0,suffixbegin);
}
If you're looking for a regular expression, use \.[^.]+$.
You have to escape the first ., otherwise it will match any character, and put a $ at the end, so it will only match at the end of a string.
Different operating systems may allow different characters in filenams, the simplest regex might be (.+)\.txt$. Get the first capture group to get the filename sans extension.