Regular expressions and characters [duplicate]

Regular expressions and characters [duplicate] - regex

This question already has answers here:
What special characters must be escaped in regular expressions?
(13 answers)
Closed 2 years ago.
Some characters, such as question marks and plus signs, have special meanings in regular expressions and must be preceded by a backslash if they are meant to represent the character itself.
May I know which is the complete list of characters which must be preceded by a backslash ?
Is it correct to say that all non alphanumeric characters must be escaped ?
And how to add a backslash to a php string , addslash() only add a slash in this few cases
single quote (')
double quote (")
backslash ()
NUL (the NUL byte)

Actually it, depends. There are many flavors of regular expressions, most common:
BRE
ERE
PCRE (even it have multiple flavors through programming languages)
If you want to, you should escape meta-characters described in references above with \ , thats all.
Or surround them in [], but this is kind of overkill.
Also, you can embed any UTF-8 character in PCRE (and some other flavors) via \x{FFFF} syntax, where
FFFF - byte, representing codepoint

Related

C++ regex for properly matching strings that contain c-style escape characters (ECMAScript style, no look-behind)

I'm a regex noob attempting to match either the contents or the entirety of a quoted segment of text without breaking on escaped quotation marks.
Put another way, I need a regex that, between two question marks, will match all characters that are not quotation marks and also any quotation mark that has an odd number of consecutive backslashes preceding it. It has to be an odd number of backslashes as a pair of backslashes escapes to a single backslash.
I've successfully created a regex that does this but it relied on look-behind and because this project is in C++ and because the regex implementation of standard C++ does not have look-behind functionality, I could not use said regex.
Here is the regex with look-behind that I came up with: "(((?<!\\)(\\\\)*\\"|[^"])*)"
The following text should produce 8 matches:
"Woah. Look. A tab."
"This \\\\\\\\\\\\\" is all one string"
"This \"\"\"\" is\" also\"\\ \' one\"\\\" string."
"These \\""are separate strings"
"The cat said,\"Yo.\""
"
\"Shouldn't it work on multiple lines?\" he asked rhetorically.
\"Of course it should.\"
"
"If you don't have exactly 8 matches, then you've failed."
Here's a picture of my (probably naive) look-behind version for the visual people among you (You know who you are):
And here's a link to this example: https://regex101.com/r/uOxqWl/1
If this is impossible to do without look-behind, please let me know.
Also, if there is a well-regarded C++ regex library that allows regex look-behind, please let me know (It doesn't have to be ECMAScript, though I would slightly prefer that).

Let's derive a garden variety regular expression for C-style strings from an English description.
A string is a quotation mark, followed by a sequence of string-characters, followed by another quotation mark.
std::regex stringMatcher ( R"("<string-character>*")" );
Obviously this doesn't work as we didn't define the string-character yet. We can do so piece by piece.
Firstly, a string character could be any character except a quotation mark and a backslash.
R"([^\\"])"
Secondly, a string character could be an escape sequence consisting of a backslash and a single other character from a fixed set.
R"(\\[abfnrtv'"\\?])"
Thirdly, it can be an octal escape sequence that consists of a backslash and three octal digits
R"(\\[0-7][0-7][0-7])"
(We simplify here a bit because the real C standard allows 1, 2 or 3 octal digits. This is easy to add.)
Fourthly, it can be a hexadecimal escape sequence that consists of a backslash, a letter x, and a hexadecimal number. The range of the number is implementation defined, so we need to accept any one.
R"(\\x[0-9a-fA-F][0-9a-fA-F]*)"
We omit universal character names, they could be added in an exactly the same way. There are none in the given test example.
So, to bring this all together:
std::regex stringMatcher ( R"("([^\\"]|\\([abfnrtv'"\\?]|[0-7][0-7][0-7]|x[0-9a-fA-F][0-9a-fA-F]*))*")" );
// collapsed the leading backslashes of all the escape sequence types together
Live demo.

Comparing two regex expressions for efficiency [duplicate]

This question already has answers here:
Why is a character class faster than alternation?
(2 answers)
Using alternation or character class for single character matching?
(3 answers)
Closed 3 years ago.
Why is:
[\s\S]+?
Much more efficient than:
(?:.|\n)+?
What are the differences between the two in terms of how they work behind the scenes?
Note: this is with DOTALL turned off. Also, from https://www.regular-expressions.info/dot.html:
JavaScript and VBScript do not have an option to make the dot match line break characters. In those languages, you can use a character class such as [\s\S] to match any character. This character matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character.

Why do regexes and string literals use different escape sequences?

The handling of escape sequences varies across languages and between string literals and regular expressions. For example, in Python the \s escape sequence can be used in regular expressions but not in string literals, whereas in PHP the \f form feed escape sequence can be used in regular expressions but not in string literals.
In PHP, there is a dedicated page for PCRE escape sequences (http://php.net/manual/en/regexp.reference.escape.php) but it does not have an official list of escape sequences that are exclusive to string literals.
As a beginner in programming, I am concerned that I may not have a full understanding of the background and context of this topic. Are these concerns valid? Is this an issue that others are aware of?
Why do different programming languages handle escape sequences differently between regular expressions and string literals?

The escape sequences found in string literals are there to stop the programing language from getting confused. For example, in many languages a string literal is denoted as characters between quotes, like so
my_string = 'x string'
But if your string contains a quote character then you need a way to tell the programming language that this should be interpreted as a literal character
my_string = 'x's string' # this will cause bugs
my_string = 'x\'s string' # lets the programing language know that the internal quote is literal and not the end of the string
I think that most programing languages have the same set of escape sequences for string literals.
Regexes are a different story, you can think of them as their own separate language that is written as a string literal. In a regex some characters like the period (.) have a special meaning and must be escaped to match their literal counterpart. Whereas other characters, when preceded by a backslash allow those characters to have special meaning.
For example
regex_string = 'A.C' # match an A, followed by any character, followed by C
regex_string = 'A\.C' # match an A, followed by a period, followed by C
regex_string = 'AsC' # match an A, followed by s, followed by C
regex_string = 'A\sC' # match an A, followed by a space character, followed by C
Because regexes are their own mini-language it doesn't make sense that all of the escape sequences in regexes are available to normal string literals.

Regular expressions are best thought of as a language in themselves, which have their own syntax. Some programming languages offer a literal syntax specifically for describing a regex, but usually a regex will be compiled from an existing string. If you create that string from literal syntax, that uses a different set of escape sequences because it is a different kind of thing, created with a different syntax, for a different context, in a different language. That's the simple and direct answer to the question.
There are different needs and requirements. Regexes have to be able to describe things that aren't a single, specific sequence of text. String literals obviously don't have that problem, but they do need a way to, say, include quotation marks in the text. That usually isn't a problem for regex syntax, because the content of the string is already determined by that point. (Some languages have a "regex literal" syntax, typically enclosing the regex in forward slashes. In these languages, forward slashes that are supposed to be part of the regex need to be escaped.)
Although I understand the obvious (\s represents multiple characters and would introduce ambiguity)
Ambiguity isn't actually a concern for most languages that support regex. It often happens that the string literal syntax and the regex syntax use the same sequence to mean different things. For example: \b represents a word boundary in regex syntax, but many languages' string literal syntax also uses it to represent a backspace character, Unicode code point 8. (Unless you meant that \s to mean "any whitespace character" doesn't make sense in the string literal context but only in the regex context - then yes, of course.)
But keep in mind - if the regex is being compiled from a string literal, then first the string literal is interpreted to figure out what the string actually contains, and then that string is used to create the regex. These are separate steps that can and do apply separate rules, so there is no conflict.
This sometimes means that code has to use a double escaping mechanism: first for the string literal, and then for the regex syntax. If you want a regex that matches a literal backslash, you might end up typing four backslashes in a string literal - since that code will create a string that actually contains only two backslashes, which in turn is what the regex syntax requires. (Some languages offer some kind of "raw" string literal facility to work around this.)

Period in .Net 3.5 Regex.IsMatch

I came across this regular expression in vb.net 3.5 code:
Regex.IsMatch(strString, "^[\w\s.+'\-\(\)\/\,\&\#]+$")
What is really confusing me is the ".+" part. I was under the impression that the period means any character and the plus sign means one or more. Following this, I feel like this regular expression should allow anything! But it doesn't, so I must be misunderstanding something. In testing it, it seems like the period and the plus sign are being taken as literals.
Could somebody help explain this to me?
Thanks!

The issue is that all of those characters are enclosed in a [character-group]. The escaping rules are different in character-groups than they are elsewhere in a RegEx expression. For instance, according to the MSDN documentation, \b inside a character-group means a backspace character whereas, outside of a character-group, it is an anchor that matches a word boundary.
According to the Regular-Expressions.info documentation:
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
Therefore, in your example RegEx expression, it looks for any one of the characters in that bracketed list, including either the literal . or + character. If you think about it, it wouldn't make any sense to use a . to mean "any character" inside of a character-group. Doing so would make the group, itself, moot. And certainly, using the + character to mean "one or more times" inside of a character-group really makes no sense.

.+ is mean any symbol in an amount of one or more. Maybe you need to escape dot like \.+?

Within the square parenthesis, dot and plus don't have their special meaning. The square brackets define a "character class". It does not contain a string but a set of characters allowed at this position.
So the expression [\w\s.+'-()/\,\&#] creates a character class of letters, digits, underscore, spaces, dots, pluses, single quotes, minuses, opening round brackets, closing round brackets, slashes, commas, ampersands and hashmarks.
The + behind the square parenthesis means you expect one or more characters of this character class.

Regex to allow just letters and special characters [duplicate]

This question already has answers here:
Regex only allow letters and some characters
(4 answers)
Closed 9 years ago.
I'm currently using the following regex to allow only characters:
"^[a-zA-Z]+$"
I would like to change it so that it allows characters and special characters like '-', and other characters which are found in non-English characters.
How can I do it?

If you need to allow specific special characters, simply include them in the character class:
"^[a-zA-Z\-]+$"
Some special characters need to be escaped, some don't.
But if you want to accept every character except numeric characters, it might be simpler to simply use:
"^\D+$"

hmm try this regex: "\D" it allows only characters and no signs. Its equivalent to [^\d].
you can add special characters just by writing them to it. For example: "[\D-+#$]+$"

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js