Escape regex special characters for tr1::regex - c++

I need to embed user-input in my regular expression, so it needs to be escaped for any regex special characters, and I don't know in advance what the string will be.
It would be something like
string pattern = "\\d+ " + myEscapeFunction(userData);
Which special characters do I need to escape? Or is there an equivalent function to Qt's QRegExp::escape?

The list of characters that you have to escape depends on which of the various regular expression grammars you're using. If you're using the default ECMAScript, it looks like the list in the QRegExp::escape documentation is a good place to start. It says:
The special characters are $, (,), *, +, ., ?, [, ,], ^, {, | and }.
That list leaves out \ for some reason.
But it's slightly more complicated than that, because inside square brackets, none of the characters except \ and ] are special, and \] has to stay unescaped.
Further, a ? that comes right after a ( is not special. For example, in (?=x) the ? should not be escaped.
I think that's pretty much it, but I haven't put enough time into this to be sure.

Related

Problem replacing characters in a Regular Expression

I'm just starting to get to grips with Regular Expressions. My first task is to remove all the characters in a string except a-z (upper and lower case), 0-9, and the characters - \ . : and ,
So I tried
objInstance.mystring.replaceAll("[^A-Za-z0-9\\- .:,]", "")
However, this still removes the hyphen and the backslash.
I suspect its the placement of the \ but some guidance would be helpful here.
You need to escape the backslash, as well as the hyphen. These are characters that have meaning in the regex so you need to escape them to have the actual character being monitored.
[A-Za-z0-9\\\-.:,] should be the correct regex. There's also a space in yours, there's no mention of it in your question so I removed that as well. There's also a ^ character in your regex. This signifies the start of a String, again as there was no mention of this in your question, I removed it in the regex.

Regular expression: page path starts with "/posts/" and ends with ".html"

I'm stuck here:
=~^/posts/(*).html
but it doesn't work
I need something that can recognise something like this:
/posts/testing.html
/posts/another-testing-issue.html
And I'm not very good using RegEx
Can anyone help me please?
EDIT:
Floris had the right answer:
^/posts/.*html$
thank you!
Briefly, the expression you need is
^\/posts\/.*\.html$
Explanation:
^ start of string
\/posts\/ literal string '/posts/'
the backslash "protects" the forward slash -
it is called "escaping", and removes any special meaning it might have
(in some applications the / would be a delimiter)
.* any number of characters
\. literal '.'
html literal 'html'
$ end of string
Now for a bit more background on regex syntax…
A
s #Peter points out in the comment, a quantifier follows "the thing to quantify". In most (all?) regex syntaxes, writing (*) will generate the error preceding token is not quantifiable. You need something in front of the *, and a ( doesn't count (unless it was escaped).
This is where the dot comes in. The dot . means "any character at all. That is its usual meaning, which is why.*` is just about the most common thing in regular expressions, meaning "I don't care about the next bit…" (usually up to an "until" - whatever follows).
Because the dot has a special meaning, when you want the exact string .html, you need to write it as \.html (there's that escape backslash again to remove the special meaning from the dot).
As a final tweak, it is not uncommon to have an extension like .htm - so you could write your expression as
\/posts\/.*\.html?$
This would make the last character, the l, optional (the ? means "zero or one times the preceding expression, which in this case is the single character immediately before it).
You can see this at work at http://regex101.com/r/bK5yC7 - it is a wonderful tool for exploring regular expressions, and also gives a nice explanation (breakdown) of every expression you type (with highlighting of any errors)
You missed a dot as single character match and didn't escape the second one as being literal:
^/posts/(.*)\.html
In most of regular expression . mean any character and * means multiplicity, so try to fix to
^/posts/(.*)\.html
\ is escape character

What's the meaning of this perl regex expression?

the regex expression is as below:
if ($ftxt =~ m|/([^=]+)="(.+)"|o)
{
.....
}
this regex seems different from many other regex.What makes me confused is the "|" ,most regex use "/" instead of "|". And , group ([^=]+) also makes me confused.I know [^=] means "the start of the string" or "=",but what does it mean by repeat '^' one or more times? ,how to explain this?
You can use different delimiters instead of /. For instance you could use:
m#/([^=]+)="(.+)"#o
Or
m~/([^=]+)="(.+)"~o
The advantage here of using something different than / is that you don't have to escape slashes, because otherwise, you'd have to use:
m/\/([^=]+)="(.+)"/o
^
[Or [/]]
([^=]+) is a capture group, and inside, you have [^=]+. [^=] is a negated class and will match any character which is not a =.
^ behaves differently at the beginning of a character class and is not the same as ^ outside a character class which means 'beginning of line'.
As for the last part o, this is a flag which I haven't met so far so a little search brought me to this post, I quote:
The /o modifier is in the perlop documentation instead of the perlre documentation since it is a quote-like modifier rather than a regex modifier. That has always seemed odd to me, but that's how it is.
Before Perl 5.6, Perl would recompile the regex even if the variable had not changed. You don't need to do that anymore. You could use /o to compile the regex once despite further changes to the variable, but as the other answers noted, qr// is better for that.
Some regexp implementations allow you to use other special characters besides / as the delimiter. This is useful if you need to use that special character inside the regular expression itself, since you don't have to escape it. (In and of itself / is not a special character in regexp syntax, but it needs escaping if it's used in the regexp literal syntax of the host language.) The docs on Perl's quote operators mention this.
This is tutorial-level stuff: square brackets ([abc]) denote a character class - it means "any of the characters inside the brackets". (In my example, it means "either a or b or c.) Inside them, the ^ special character has a different meaning, it inverts the character class. So, [^=] means "any character except =", and [^=]+ means "one or more characters that aren't =".
Quoting the docs on Perl's RE syntax:
You can specify a character class, by enclosing a list of characters in [] , which will match any character from the list. If the first character after the "[" is "^", the class matches any character not in the list.
It is meant to match equation like expressions, to capture the key and values separately. Imagine you have a statement like height="30px", and you want to capture the height attribute name, as well as its value 30px.
So you have m|/([^=]+)="(.+)"|.
The key is supposed to be everything before the = is encountered. So [^=] captures it. The ^ is a negation metacharacter when used as the first character inside [] brackets. It means that it will match any character except =, which is what you want. The / is probably a mistake, if you need to capture the group, you should not use it, or if it is indeed intended, it means to literally match an opening parentheses. Since it is a special character, it needs to be escaped, that's why \(. if you mean to capture the group, it should be ([^=]+).
Next comes the = sign, which you don't care about. Then the quotes which contain the value. So you capture it like "(.+)". the .+ will go on matching greedily every character, including the final ". But then it will find that it can't match the final " in the regex, so it will backtrack, give up the last " the regex (.+) captured, so that leaves the string within the quotes to be captured in the group. Now you are ready to access the key and value through $1 and $2. Cool, isn't it?

Regular Expressions pattern with Special characters

I'm working on a regular expressions pattern, but it contains a number of special characters. I'm not really sure how to incorporate them in a normal regex pattern string. Specifically, I need to test to see if a string contains '+/-'...
I've tried using quotes etc but have no luck (I'm extremely new to regex). I am coding this in C# 4.0.
One string example is "3Z1Z +/- 5.5"
Any help is much appreciated - Thanks a lot!
Create a simple regex :
foundMatch = Regex.IsMatch(SubjectString, #"\+/-");
Will return true if this sequence of characters is found anywhere in your string. The explanation is left as an exercise to you.
Read more here.
These are part of the special character list (see also). Basically, add them to the pattern by prefixing them with a backslash (\). e.g. + becomes \+
^\+|\-$ # + or -
The same would go for anything else with special meaning, such as ., {, }, (, ), ^, $, |, [, ], etc.
There are some exceptions though. For instance, when creating a class such as: [a-z] the hyphen (-) would have special meaning (all letters from a through z). So if you wanted a literal hyphen you'd have to escape it (unless it falls as the last character of the class). e.g.
[a-z-A-Z] # hyphen should be escaped if you wanted a literal hyphen
[a-z\-A-Z] # the "correct" counter-part
[a-zA-Z-] # actually legal because it's inserted as the last character
# and therefor treated as a literal hyphen despite not being
# escaped.

Is "#" a special character in regular expressions?

I am working on an email filter and I have come across a list of regular expressions that are used to block all emails coming from senders that match a record in that list. While browsing through the list, I have discovered that all occurrences of the # character are escaped with a \.
Does the # mean anything special in regular expressions and needs to be escaped like so \#?
It's normally not a special character, but it doesn't hurt to escape it which is probably why many people do it, they just want to be safe (or they think it's a special character).
No, the # is not special character in regex.
The the \ can be use in this meaning
Pattern:
\Q...\E
Def
Matches the characters between \Q and \E literally, suppressing the meaning of special characters.
Example:
\Q+-/\E matches +-/