This may be a theoretical question.
Why does underscore _ comes under \w in regex and not under \W
I hope this isn't primarily opinion based, because there should be a reason.
Citation would be great, if at all available.
From Wikipedia's Regular expression article (emphasis mine):
An additional non-POSIX class understood by some tools is [:word:], which is usually defined as [:alnum:] plus underscore. This reflects the fact that in many programming languages these are the characters that may be used in identifiers.
In perl, tcl and vim, this non-standard class is represented by \w (and characters outside this class are represented by \W).
\w matches any single code point that has any of the following properties:
\p{GC=Alphabetic} (letters and some more unicode points)
\p{GC=Mark} (Mark: Spacing, non-spacing, enclosing)
\p{GC=Connector_Punctuation} (e.g. underscore)
\p{GC=Decimal_Number} (numbers and other variants of numbers)
\p{Join_Control} (code points U+0200C and U+0200D)
These properties are used in the composition of programming language identifiers in scripts. For instance[1]:
The Connector Punctuation (\p{GC=Connector_Punctuation}) is added in for programming language identifiers, thus adding "_" and similar characters.
There is a[2]:
general intent that an identifier consists of a string of characters beginning with a letter or an ideograph, and followed by any number of letters, ideographs, digits, or underscores.
The \p{Join_Control} was actually recently added to the character class \w as well and here's a message that perl devs exchanged for its implementation, supporting my earlier mention that \w is used to compose identifiers.
Related
I'm reading Jan Goyvaerts' "Regular Expressions: The Complete Tutorial and Reference" to touch up on my Regex.
In the second chapter, Jan has a section on "special characters:"
Special Characters
Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {, These special characters are often called “metacharacters”. Most of them are errors when used alone.
(emphasis mine)
I understand that only open square bracket and open curly brace are special since a close brace or bracket is clearly a literal if there's no preceding open. However, why does Jan specify that close parenthesis is a special character if the other two close's aren't?
Short answer
The regex flavors in my book do not require } and ] to be escaped (except for ] in character classes in JavaScript). So I don't because I like to have as few backslashes in my regexes as possible. You can escape them if you find your regexes clearer that way.
Full answer
First of all, anyone learning about regular expressions needs to understand the importance of the qualifier "In the regex flavors discussed in this tutorial..." You cannot discuss regular expressions without stating which regex flavor(s) you're talking about.
What I wrote is true for the flavors my book (2006 edition) discusses. In those flavors, ) is treated as a token that closes a group. It is a syntax error if used without a corresponding (. So ) has a special meaning when used all on its own.
} does not have a special meaning when used all on its own. You never need to escape it with these flavors. If you wanted to match something like {7} or {7,42} literally, you only need to escape the opening {. If you want to argue that } is special because it sometimes has a special meaning, then you would have to say the same about , which becomes special in the same situation.
] does not have a special meaning outside character classes in these regex flavors. You never need to escape it outside character classes. The paragraph you quoted does not talk about special characters inside character classes. That's a totally different list (\, ], ^, and -) discussed in a later chapter.
Now as to why: most regular expressions have plenty of backslashes already. My preferred style is to escape as few characters as needed. So I never escape }. I escape ] in character classes when using JavaScript because that's the only way. But with other flavors I place ] at the start of the character class or after the negating caret so I don't need to escape it. My teaching materials teach this style. When my products RegexBuddy or RegexMagic convert or generate regular expressions, they also use as few backslashes as needed.
I often see people new to regular expressions needlessly escape characters like ", ', or / because they need to be escaped when the regular expression is quoted as a source code literal in certain programming languages. But the regular expression itself does not require these to be escaped.
I even see people escape characters like < or >. This is a bad habit because in some regex flavors \< and \> are word boundaries. This includes recent versions of PCRE (but not the PCRE that was current in 2006).
But, if you find it confusing to see unescaped } and ] used as literals, you are free to escape them in your regexes. Except for < and >, all the flavors discussed in my book allow you to escape any punctuation character to match that character literally, even if the character on its own would be a literal already.
So somebody saying that } and ] are special characters in regular expressions is not wrong if "special characters" means "characters that have a special meaning either on their own or when used in combination with other characters". But that list would also include , (quantifier), : (non-capturing group), - (mode modifier), ! (negative lookaround), < (lookbehind), and - (character class range).
But if "special characters" means "characters that have a special meaning on their own", then } and ] are not included in the list for the flavors my book covers.
The following paragraphs give an answer. I'm citing from Jan's website, not from the book, though:
If you forget to escape a special character where its use is not
allowed, such as in +1, then you will get an error message.
Most regular expression flavors treat the brace { as a literal
character, unless it is part of a repetition operator like a{1,3}.
So you generally do not need to escape it with a backslash, though you
can do so if you want. But there are a few exceptions.
Java requires
literal opening braces to be escaped.
Boost and
std::regex
require all literal braces to be escaped.
] is a literal outside character
classes.
Different rules apply inside character classes. Those are discussed in
the topic about character classes. Again, there are exceptions.
std::regex and
Ruby require closing
square brackets to be escaped even outside character classes.
It seems like he uses "needs to be escaped" as his definition for "special character", and unlike ), the ] and } characters need not be escaped in most flavours.
That said, you wouldn't be wrong calling them special characters as well. It's definitely a best practice to always escape them, and in no flavour \] and \} mean anything else than a literal ] or }.
On the other hand, they have their special meaning only inside a specific (parsing) context, namely when they follow [ and { respectively. There are similar cases: :=><!#'&, all have a non-literal meaning inside a specific context, and we wouldn't normally call these "special characters" either.
And while we could say the same about ), almost no flavour allows for it to occur on its own outside of groups, because pairs of parentheses always need to match. Its only usage is in the special context, and therefore ) is considered a special character.
Every where in a regular expression, regardless of engine and its standards, a parenthesis should be escaped to mean a literal character. Even the closing parenthesis. However, it doesn't apply to POSIX regular expressions:
) The <right-parenthesis> shall be special when matched with a preceding <left-parenthesis>, both outside a bracket expression.
But the interesting part is that POSIX has a separate definition for a right-parenthesis for times it should be treated as a special character. It doesn't have it for } or ].
Why other engines don't follow this rule?
Call it implementation peculiarities or historical reasons that have something to do with Perl as commented in PCRE source code:
/* It appears that Perl allows any characters whatsoever, other than
a closing parenthesis, to appear in arguments, so we no longer insist on
letters, digits, and underscores. */
It seems that with all that special clusters in more advanced engines treating a closing parenthesis as a special character will cost much less than implementing POSIX standard.
From experiments, it appears that unlike ), the characters ] and } are only interpreted as delimiters when the corresponding opening [ or { has been met.
Though IMO the same rule could apply to ), that's the way it is.
This might be due to the way the parser was written: parenthesis can be nested so that the balancing needs to be checked, whereas brackets/curly braces are just flagged. (For instance, [[] is a valid class definition. [[]] is also a valid pattern but understood as [\[]\].)
At http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap09.html (which looks sort of like an official specification for Posix) it lists the character classes which must be supported in regular expressions, including e.g. [:space:].
But where are those character classes defined? Where can I find definitively which characters [:space:] should match? I'm looking for an actual standard, not a wiki-like-page-thing or somebody's blog. Thanks.
This set is locale dependent.
The POSIX one is detailed here:
http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap07.html
space
Define characters to be classified as white-space characters.
In the POSIX locale, exactly <space>, <form-feed>, <newline>, <carriage-return>, <tab>, and <vertical-tab> shall be included.
In a locale definition file, no character specified for the keywords upper, lower, alpha, digit, graph, or xdigit shall be specified. The <space>, <form-feed>, <newline>, <carriage-return>, <tab>, and <vertical-tab> of the portable character set, and any characters included in the class blank are automatically included in this class.
In addition the the previously mentioned characters, locales are free to add any number of horizontal or vertical "space" characters, like for example "non breaking space", "fixed width space" and similar.
To know if a given character is part of this class or not in the current locale, you use the isspace function.
How to rewrite the [a-zA-Z0-9!$* \t\r\n] pattern to match hyphen along with the existing characters ?
The hyphen is usually a normal character in regular expressions. Only if it’s in a character class and between two other characters does it take a special meaning.
Thus:
[-] matches a hyphen.
[abc-] matches a, b, c or a hyphen.
[-abc] matches a, b, c or a hyphen.
[ab-d] matches a, b, c or d (only here the hyphen denotes a character range).
Escape the hyphen.
[a-zA-Z0-9!$* \t\r\n\-]
UPDATE:
Never mind this answer - you can add the hyphen to the group but you don't have to escape it. See Konrad Rudolph's answer instead which does a much better job of answering and explains why.
It’s less confusing to always use an escaped hyphen, so that it doesn't have to be positionally dependent. That’s a \- inside the bracketed character class.
But there’s something else to consider. Some of those enumerated characters should possibly be written differently. In some circumstances, they definitely should.
This comparison of regex flavors says that C♯ can use some of the simpler Unicode properties. If you’re dealing with Unicode, you should probably use the general category \p{L} for all possible letters, and maybe \p{Nd} for decimal numbers. Also, if you want to accomodate all that dash punctuation, not just HYPHEN-MINUS, you should use the \p{Pd} property. You might also want to write that sequence of whitespace characters simply as \s, assuming that’s not too general for you.
All together, that works out to apattern of [\p{L}\p{Nd}\p{Pd}!$*] to match any one character from that set.
I’d likely use that anyway, even if I didn’t plan on dealing with the full Unicode set, because it’s a good habit to get into, and because these things often grow beyond their original parameters. Now when you lift it to use in other code, it will still work correctly. If you hard‐code all the characters, it won’t.
[-a-z0-9]+,[a-z0-9-]+,[a-z-0-9]+ and also [a-z-0-9]+ all are same.The hyphen between two ranges considered as a symbol.And also [a-z0-9-+()]+ this regex allow hyphen.
use "\p{Pd}" without quotes to match any type of hyphen. The '-' character is just one type of hyphen which also happens to be a special character in Regex.
Is this what you are after?
MatchCollection matches = Regex.Matches(mystring, "-");
The handling of escape sequences varies across languages and between string literals and regular expressions. For example, in Python the \s escape sequence can be used in regular expressions but not in string literals, whereas in PHP the \f form feed escape sequence can be used in regular expressions but not in string literals.
In PHP, there is a dedicated page for PCRE escape sequences (http://php.net/manual/en/regexp.reference.escape.php) but it does not have an official list of escape sequences that are exclusive to string literals.
As a beginner in programming, I am concerned that I may not have a full understanding of the background and context of this topic. Are these concerns valid? Is this an issue that others are aware of?
Why do different programming languages handle escape sequences differently between regular expressions and string literals?
The escape sequences found in string literals are there to stop the programing language from getting confused. For example, in many languages a string literal is denoted as characters between quotes, like so
my_string = 'x string'
But if your string contains a quote character then you need a way to tell the programming language that this should be interpreted as a literal character
my_string = 'x's string' # this will cause bugs
my_string = 'x\'s string' # lets the programing language know that the internal quote is literal and not the end of the string
I think that most programing languages have the same set of escape sequences for string literals.
Regexes are a different story, you can think of them as their own separate language that is written as a string literal. In a regex some characters like the period (.) have a special meaning and must be escaped to match their literal counterpart. Whereas other characters, when preceded by a backslash allow those characters to have special meaning.
For example
regex_string = 'A.C' # match an A, followed by any character, followed by C
regex_string = 'A\.C' # match an A, followed by a period, followed by C
regex_string = 'AsC' # match an A, followed by s, followed by C
regex_string = 'A\sC' # match an A, followed by a space character, followed by C
Because regexes are their own mini-language it doesn't make sense that all of the escape sequences in regexes are available to normal string literals.
Regular expressions are best thought of as a language in themselves, which have their own syntax. Some programming languages offer a literal syntax specifically for describing a regex, but usually a regex will be compiled from an existing string. If you create that string from literal syntax, that uses a different set of escape sequences because it is a different kind of thing, created with a different syntax, for a different context, in a different language. That's the simple and direct answer to the question.
There are different needs and requirements. Regexes have to be able to describe things that aren't a single, specific sequence of text. String literals obviously don't have that problem, but they do need a way to, say, include quotation marks in the text. That usually isn't a problem for regex syntax, because the content of the string is already determined by that point. (Some languages have a "regex literal" syntax, typically enclosing the regex in forward slashes. In these languages, forward slashes that are supposed to be part of the regex need to be escaped.)
Although I understand the obvious (\s represents multiple characters and would introduce ambiguity)
Ambiguity isn't actually a concern for most languages that support regex. It often happens that the string literal syntax and the regex syntax use the same sequence to mean different things. For example: \b represents a word boundary in regex syntax, but many languages' string literal syntax also uses it to represent a backspace character, Unicode code point 8. (Unless you meant that \s to mean "any whitespace character" doesn't make sense in the string literal context but only in the regex context - then yes, of course.)
But keep in mind - if the regex is being compiled from a string literal, then first the string literal is interpreted to figure out what the string actually contains, and then that string is used to create the regex. These are separate steps that can and do apply separate rules, so there is no conflict.
This sometimes means that code has to use a double escaping mechanism: first for the string literal, and then for the regex syntax. If you want a regex that matches a literal backslash, you might end up typing four backslashes in a string literal - since that code will create a string that actually contains only two backslashes, which in turn is what the regex syntax requires. (Some languages offer some kind of "raw" string literal facility to work around this.)
English, of course, is a no-brainer for regex because that's what it was originally developed in/for:
Can regular expressions understand this character set?
French gets into some accented characters which I'm unsure how to match against - i.e. are è and e both considered word characters by regex?
Les expressions régulières peuvent comprendre ce jeu de caractères?
Japanese doesn't contain what I know as regex word characters to match against.
正規表現は、この文字を理解でき、設定?
Short answer: yes.
More specifically it depends on your regex engine supporting unicode matches (as described here).
Such matches can complicate your regular expressions enormously, so I can recommend reading this unicode regex tutorial (also note that unicode implementations themselves can be quite a mess so you might also benefit from reading Joel Spolsky's article about the inner workings of character sets).
"[\p{L}]"
This regular expression contains all characters that are letters, from all languages, upper and lower case.
so letters like (a-z A-Z ä ß è 正 の文字を理解) are accepted but signs like (, . ? > :) or other similar ones are not.
the brackets [] mean that this expression is a set.
If you want unlimited number of letters from this set to be accepted, use an astrix * after the brackets, like this: "[\p{L}]*"
it is always important to make sure you take care of white space in your regex. since your evaluation might fail because of white space. To solve this you can use: "[\p{L} ]*" (notice the white space inside brackets)
If you want to include the numbers as well, "[\p{L|N} ]*" can help. p{N} matches any kind of numeric character in any script.
As far as I know, there isn't any specific pattern you can use i.e. [a-zA-Z] to match "è", but you can always match them in separately, i.e. [a-zA-Zè正]
Obviously that can make your regexp immense, but you can always control this by adding your strings into variables, and only passing the variables into the expressions.
Generally speaking, regex is more for grokking machine-readable text than for human-readable text. It is in many ways a more general answer to the whole XML with regex thing; regex is by its very nature incapable of properly parsing human language, because the language is more complex than what you are using to parse it.
If you want to break down human language (English included), you would want to use a language analysis tool or even an AI, not mere regular expressions.
/[\p{Latin}]/ should for example, include Latin alphabet. You can get the full explanation and reference here.
it is not about the regular expression but about framework that executes it. java and .net i think are very good in handling unicode. so "è and e both considered word characters by regex" is true.
It depends on the implementation and the character set. In general the answer is "Yes," but it may require additional setup on your part.
In Perl, for example, the meaning of things like \w is altered by the chosen locale (use locale).
This SO thread might help. It includes the Unicode character classes you can use in a regex (e.g., [Ll] is all lowercase letters, regardless of language).