Period in .Net 3.5 Regex.IsMatch - regex

I came across this regular expression in vb.net 3.5 code:
Regex.IsMatch(strString, "^[\w\s.+'\-\(\)\/\,\&\#]+$")
What is really confusing me is the ".+" part. I was under the impression that the period means any character and the plus sign means one or more. Following this, I feel like this regular expression should allow anything! But it doesn't, so I must be misunderstanding something. In testing it, it seems like the period and the plus sign are being taken as literals.
Could somebody help explain this to me?
Thanks!

The issue is that all of those characters are enclosed in a [character-group]. The escaping rules are different in character-groups than they are elsewhere in a RegEx expression. For instance, according to the MSDN documentation, \b inside a character-group means a backspace character whereas, outside of a character-group, it is an anchor that matches a word boundary.
According to the Regular-Expressions.info documentation:
In most regex flavors, the only special characters or metacharacters inside a character class are the closing bracket (]), the backslash (), the caret (^), and the hyphen (-). The usual metacharacters are normal characters inside a character class, and do not need to be escaped by a backslash.
Therefore, in your example RegEx expression, it looks for any one of the characters in that bracketed list, including either the literal . or + character. If you think about it, it wouldn't make any sense to use a . to mean "any character" inside of a character-group. Doing so would make the group, itself, moot. And certainly, using the + character to mean "one or more times" inside of a character-group really makes no sense.

.+ is mean any symbol in an amount of one or more. Maybe you need to escape dot like \.+?

Within the square parenthesis, dot and plus don't have their special meaning. The square brackets define a "character class". It does not contain a string but a set of characters allowed at this position.
So the expression [\w\s.+'-()/\,\&#] creates a character class of letters, digits, underscore, spaces, dots, pluses, single quotes, minuses, opening round brackets, closing round brackets, slashes, commas, ampersands and hashmarks.
The + behind the square parenthesis means you expect one or more characters of this character class.

Related

Why is only ) a special character and not } or ]?

I'm reading Jan Goyvaerts' "Regular Expressions: The Complete Tutorial and Reference" to touch up on my Regex.
In the second chapter, Jan has a section on "special characters:"
Special Characters
Because we want to do more than simply search for literal pieces of text, we need to reserve certain characters for special use. In the regex flavors discussed in this tutorial, there are 12 characters with special meanings: the backslash \, the caret ^, the dollar sign $, the period or dot ., the vertical bar or pipe symbol |, the question mark ?, the asterisk or star *, the plus sign +, the opening parenthesis (, the closing parenthesis ), the opening square bracket [, and the opening curly brace {, These special characters are often called “metacharacters”. Most of them are errors when used alone.
(emphasis mine)
I understand that only open square bracket and open curly brace are special since a close brace or bracket is clearly a literal if there's no preceding open. However, why does Jan specify that close parenthesis is a special character if the other two close's aren't?
Short answer
The regex flavors in my book do not require } and ] to be escaped (except for ] in character classes in JavaScript). So I don't because I like to have as few backslashes in my regexes as possible. You can escape them if you find your regexes clearer that way.
Full answer
First of all, anyone learning about regular expressions needs to understand the importance of the qualifier "In the regex flavors discussed in this tutorial..." You cannot discuss regular expressions without stating which regex flavor(s) you're talking about.
What I wrote is true for the flavors my book (2006 edition) discusses. In those flavors, ) is treated as a token that closes a group. It is a syntax error if used without a corresponding (. So ) has a special meaning when used all on its own.
} does not have a special meaning when used all on its own. You never need to escape it with these flavors. If you wanted to match something like {7} or {7,42} literally, you only need to escape the opening {. If you want to argue that } is special because it sometimes has a special meaning, then you would have to say the same about , which becomes special in the same situation.
] does not have a special meaning outside character classes in these regex flavors. You never need to escape it outside character classes. The paragraph you quoted does not talk about special characters inside character classes. That's a totally different list (\, ], ^, and -) discussed in a later chapter.
Now as to why: most regular expressions have plenty of backslashes already. My preferred style is to escape as few characters as needed. So I never escape }. I escape ] in character classes when using JavaScript because that's the only way. But with other flavors I place ] at the start of the character class or after the negating caret so I don't need to escape it. My teaching materials teach this style. When my products RegexBuddy or RegexMagic convert or generate regular expressions, they also use as few backslashes as needed.
I often see people new to regular expressions needlessly escape characters like ", ', or / because they need to be escaped when the regular expression is quoted as a source code literal in certain programming languages. But the regular expression itself does not require these to be escaped.
I even see people escape characters like < or >. This is a bad habit because in some regex flavors \< and \> are word boundaries. This includes recent versions of PCRE (but not the PCRE that was current in 2006).
But, if you find it confusing to see unescaped } and ] used as literals, you are free to escape them in your regexes. Except for < and >, all the flavors discussed in my book allow you to escape any punctuation character to match that character literally, even if the character on its own would be a literal already.
So somebody saying that } and ] are special characters in regular expressions is not wrong if "special characters" means "characters that have a special meaning either on their own or when used in combination with other characters". But that list would also include , (quantifier), : (non-capturing group), - (mode modifier), ! (negative lookaround), < (lookbehind), and - (character class range).
But if "special characters" means "characters that have a special meaning on their own", then } and ] are not included in the list for the flavors my book covers.
The following paragraphs give an answer. I'm citing from Jan's website, not from the book, though:
If you forget to escape a special character where its use is not
allowed, such as in +1, then you will get an error message.
Most regular expression flavors treat the brace { as a literal
character, unless it is part of a repetition operator like a{1,3}.
So you generally do not need to escape it with a backslash, though you
can do so if you want. But there are a few exceptions.
Java requires
literal opening braces to be escaped.
Boost and
std::regex
require all literal braces to be escaped.
] is a literal outside character
classes.
Different rules apply inside character classes. Those are discussed in
the topic about character classes. Again, there are exceptions.
std::regex and
Ruby require closing
square brackets to be escaped even outside character classes.
It seems like he uses "needs to be escaped" as his definition for "special character", and unlike ), the ] and } characters need not be escaped in most flavours.
That said, you wouldn't be wrong calling them special characters as well. It's definitely a best practice to always escape them, and in no flavour \] and \} mean anything else than a literal ] or }.
On the other hand, they have their special meaning only inside a specific (parsing) context, namely when they follow [ and { respectively. There are similar cases: :=><!#'&, all have a non-literal meaning inside a specific context, and we wouldn't normally call these "special characters" either.
And while we could say the same about ), almost no flavour allows for it to occur on its own outside of groups, because pairs of parentheses always need to match. Its only usage is in the special context, and therefore ) is considered a special character.
Every where in a regular expression, regardless of engine and its standards, a parenthesis should be escaped to mean a literal character. Even the closing parenthesis. However, it doesn't apply to POSIX regular expressions:
) The <right-parenthesis> shall be special when matched with a preceding <left-parenthesis>, both outside a bracket expression.
But the interesting part is that POSIX has a separate definition for a right-parenthesis for times it should be treated as a special character. It doesn't have it for } or ].
Why other engines don't follow this rule?
Call it implementation peculiarities or historical reasons that have something to do with Perl as commented in PCRE source code:
/* It appears that Perl allows any characters whatsoever, other than
a closing parenthesis, to appear in arguments, so we no longer insist on
letters, digits, and underscores. */
It seems that with all that special clusters in more advanced engines treating a closing parenthesis as a special character will cost much less than implementing POSIX standard.
From experiments, it appears that unlike ), the characters ] and } are only interpreted as delimiters when the corresponding opening [ or { has been met.
Though IMO the same rule could apply to ), that's the way it is.
This might be due to the way the parser was written: parenthesis can be nested so that the balancing needs to be checked, whereas brackets/curly braces are just flagged. (For instance, [[] is a valid class definition. [[]] is also a valid pattern but understood as [\[]\].)

Why do escape characters in regex mismatch?

If I want to match the dot symbol (.) I have to write this regex:
/\./
Escape character is needed to match the symbol itself.
If I want to match the 'd' symbol I have to write this one:
/d/
Escape character is not needed to match the symbol itself.
And if I want to match any character (/./) or any digit character (/\d/) it's vice versa.
It seems to me that this approach is not very consistent. What is the reasoning that stands behind it?
Thank you.
The . character is a reserved regular expression keyword. The d isn't. You need to include the escape character when you match a period to explicitly tell regex that you want to use the period as a normal matching character. d by itself isn't a reserved word, so you don't need to escape it, but \d is a reserved word.
I can see how, to someone coming to regex it can be a little odd, but the . is used so often, and I can't think of a time I've really needed to match periods it just makes more sense to have it be one character without the backslash.

"Match a literal character", or "match a character literally"?

I was making a RegEx using the regex101 tool and read in the explanation field
[.] - the literal character .
[\.] - matches the character . literally
I get lost between "literal character" and "character literally".
What is the difference between these two?
There is no difference. Sorry, I take that back. The only difference the words that Firas Dib, the author of regx101, chose to explain various tokens.
A literal character or matching something literally refers to specifying an actual character in the text: for instance, a to match a, as opposed to a character class such as \w that could also match a.
You can match a literal period in either of these three ways:
\.
[.]
[\.]
Which Option is Better?
Some people like option 2 because it makes it clear you are matching a period, not the catch-all dot. It stands out. For myself, I use \.. Some people will say that using a character class is less optimal, but on modern processors it makes no difference. You pick.
Option 3 is over the top and is typically used when someone doesn't know that periods don't need to be escaped inside a character class. In my view it's confusing. What did the author mean? Were they trying to create a character class to match either a backslash or a period, and made a typo? (That would be [\\.]

What does (^?)* mean in this regex?

I have this regex:
^(^?)*\?(.*)$
If I understand correctly, this is the breakdown of what it does:
^ - start matching from the beginning of the string
(^?)* - I don't know know, but it stores it in $1
\? - matches a question mark
(.*)$ - matches anything until the end of the string
So what does (^?)* mean?
The (^?) is simply looking for the literal character ^. The ^ character in a regex pattern only has special meaning when used as the first character of the pattern or the first character in a grouping match []. When used outside those 2 positions the ^ is interpreted literally meaning in looks for the ^ character in the input string
Note: Whether or not ^ outside of the first and grouping position is interpreted literally is regex engine specific. I'm not familiar enough with LUA to state which it does
Lua does not have a conventional regexp language, it has Lua patterns in its place. While they look a lot like regexp, Lua patterns are a distinct language of their own that has a simpler set of rules and most importantly lacks grouping and alternation features.
Interpreted as a Lua pattern, the example will surprising a longtime regexp user since so many details are different.
Lua patterns are described in PiL, and at a first glance are similar enough to a conventional regexp to cause confusion. The biggest differences are probably the lack of an alternation operator |, parenthesis are only used to mark captures, quantifiers (?, -, +, and *) only apply to a character or character class, and % is the escape character not \. A big clue that this example was probably not written with Lua in mind is the lack of the Lua pattern quoting character % applied to any (or ideally, all) of the non-alphanumeric characters in the pattern string, and the suspicious use of \? which smells like a conventional regexp to match a single literal ?.
The simple answer to the question asked is: (^?)* is not a recommended form, and would match ^* or *, capturing the presence or absence of the caret. If that were the intended effect, then I would write it as (%^?)%* to make that clearer.
To see why this is the case, let's take the pattern given and analyze it as a Lua pattern. The entire pattern is:
^(^?)*\?(.*)$
Handed to string.match(), it would be interpreted as follows:
^ anchors the match to the beginning of the string.
( marks the beginning of the first capture.
^ is not at the beginning of the pattern or a character class, so it matches a literal ^ character. For clarity that should likely have been written as %^.
? matches exactly zero or one of the previous character.
) marks the end of the first capture.
* is not after something that can be quantified so it matches a literal * character. For clarity that should likely have been written as %*.
\ in a pattern matches itself, it is not an escape character in the pattern language. However, it is an escape character in a Lua short string literal, making the following character not special to the string literal parser which in this case is moot because the ? that follows was not special to it in any case. So if the pattern were enclosed in double or single quotes, then the \ would be absorbed by string parsing. If written in a long string (as [[^(^?)*\?(.*)$]], the backslash would survive the string parser, to appear in the pattern.
? matches exactly zero or one of the previous character.
( marks the beginning the second capture.
. matches any character at all, effectively a synonym for the class [\000-\255] (remember, in Lua numeric escapes are in decimal not octal as in C).
* matches zero or more of the previous character, greedily.
) marks the end of the second capture.
$ anchors the pattern to the end of the string.
So it matches and captures an optional ^ at the beginning of the string, followed by *, then an optional \ which is not captured, and captures the entire rest of the string. string.match would return two strings on success (either or both of which might be zero length), or nil on failure.
Edit: I've fixed some typos, and corrected an error in my answer, noticed by Egor in a comment. I forgot that in patterns, special symbols loose their specialness when in a spot where it can't apply. That makes the first asterisk match a literal asterisk rather than be an error. The cascade of that falls through most of the answer.
Note that if you really want a true regexp in Lua, there are libraries available that will provide it. That said, the built-in pattern language is quite powerful. If it is not sufficient, then you might be best off adopting a full parser, and use LPeg which can do everything a regexp can and more. It even comes with a module that provides a complete regexp syntax that is translated into an LPeg grammar for execution.
In this case, the (^?) refers to the previous string "^" meaning the literal character ^ as Jared has said. Check out regexlib for any further deciphering.
For all your Regex needs: http://regexlib.com/CheatSheet.aspx
It looks to me like the intent of the creator of the expression was to match any number of ^ before the question mark, but only wanted to capture the first instance of ^. However, it may not be a valid expression depending on the engine, as others have stated.

Regular expression for parsing string inside ""

<A "SystemTemperatureOutOfSpec" >
What should be the regular expression for parsing the string inside "". In the above sample it is 'SystemTemperatureOutOfSpec'
In JavaScript, this regexp:
/"([^"]*)"/
ex.
> /"([^"]*)"/.exec('<A "SystemTemperatureOutOfSpec" >')[1]
"SystemTemperatureOutOfSpec"
Similar patterns should work in a bunch of other programming languages.
try this
string Exp = "\"!\"";
I am not sure I understand your question well but if you need to match everything between double quotes, here it is: /(?<=").*?(?=")/s
(?<=<A\s")(?<content>.*)(?="\s>)
Regular expressions don't get much easier than this, so you should be able to solve it by yourself. Here's how you go about doing that:
The first step is to try to define as precisely as possible what you want to find. Let's start with this: you want to find a quote, followed by some number of characters other than a quote, followed by a quote. Is that correct? If so, our pattern has three parts: "a quote", "some characters other than a quote", and "a quote".
Now all we need to do is figure out what the regular expressions for those patterns are.
A quote
For "a quote", the pattern is literally ". Regular expressions have special characters which you have to be aware of (*, ., etc). Anything that's not a special character matches itself, and " is one of those characters. For a complete list of special characters for your language, see the documentation.
Characters other than a quote
So now the question is, how do we match "characters other than a quote"? That sounds like a range. A range is square brackets with a list of allowable characters. If the list begins with ^ it means it is a list of not-allowed characters. We want any characters other than a quote, so that means [^"].
"Some"
That range just means any one of the characters in the range, but we want "some". "Some" usually means either zero-or-more, or one-or-more. You can place * after a part of an expression to mean zero-or-more of that part. Likewise, use + to mean one-or-more (and ? means zero-or-one). There are a few other variations, but that's enough for this problem.
So, "some characters other than a quote" is the range [^"] (any character other than a quote) followed by * (zero-or-more). Thus, [^"]*
Putting it all together
This is the easy part: just combine all the pieces. A quote, followed by some characters other than a quote, followed by a quote, is "[^"]*".
Capturing the interesting part
The pattern we have will now match your string. What you want, however, is just the part inside the quotes. For that you need a "capturing group", which is denoted by parenthesis. To capture a part of a regular expression, put it in parenthesis. So, if we want to capture everything but the beginning and ending quote, the pattern becomes "([^"]*)".
And that's how you learn regular expressions. Break your problem down into a precise statement composed of short sequences of characters, figure out the regular expression for each sequence, then put it all together.
The pattern in this answer may not actually be the perfect answer for you. There are some edge cases to worry about. For example, you may only want to match a quote following a non-word character, or only quotes at the beginning or end of a word. That's all possible, but is highly dependent on your exact problem. Figuring out how to do that is just as easy though -- decide what you want, then look at the documentation to see how to accomplish that.
Spend one day practicing on regular expressions and you'll never have to ask anyone for help with regular expressions for the rest of your career. They aren't hard, but they do require concentrated study.
Are you sure you need regular expression matching here? Looking at your "string" you might be better off using a Xml parser?