Regular expression explanation required - regex

I came across this regular expression which is used to check for alphabetic strings. Can anyone explain how it works to me?
/^\pL++$/uD
Thanks.

\pL+ (sometimes written as \p{L}) matches one or more Unicode letter(s). I prefer \p{L} to \pL because there are other Unicode properties like \p{Lu} (uppercase letter) that only work with the braces; \pLu would mean "a Unicode letter followed by the letter u").
The additional + makes the quantifier possessive, meaning that it will never relinquish any characters it has matched, even if that means an overall match will fail. In the example regex, this is unnecessary and can be omitted.
^ and $ anchor the match at the start and end of the string, ensuring that the entire string has to consist of letters. Without them, the regex would also match a substring surrounded by non-letters.
The entire regex is delimited by slashes (/). After the trailing slash, PHP regex options follow. u is the Unicode option (necessary to handle the Unicode property). D ensures that the $ only matches at the very end of the string (otherwise it would also match right before the final newline in a string if that string ends in a newline).

Looks like PCRE flavor.
According to RegexBuddy:
Assert position at the beginning of the string «^»
A character with the Unicode property “letter” (any kind of letter from any language) «\pL++»
Between one and unlimited times, as many times as possible, without giving back (possessive) «++»
Assert position at the end of the string (or before the line break at the end of the string, if any) «$»

This looks like Unicode processing.. I found a neat article here that seems to explain \pL the rest are anchors and repetition characters.. which are also explained on this site:
http://www.regular-expressions.info/unicode.html
Enjoy

Related

Regex to match other than listed string

I need to select a value which not listed in following string including all special characters.
List of string and requirement that need to rejected:
XNIL
SNIL
All special characters
My expression is like this (?!XNIL|SNIL|[\W])\w+
The problem is, if my text have a word XNIL or SNIL, it still allow the word NIL. But i have listed the word XNIL and SNIL to be rejected. Any mistake did i made here?
You can check my regex online here -> http://regexr.com/3cdsl
This seems to work on your test page: (?!(XNIL|SNIL|\W+))\b\w+ At least it solves the XNIL/SNIL problem.
The reason why your regex was matching XNIL was it was matching from the \w+. To see why, take your original and change \w+ to \w and notice the difference.
UPDATE:
Based on your feedback, you also wish to exclude _.
Because _ is used in programming language symbols, and [arguably] regexes were created, of, by, and for programmers, _ is considered a "word" char (i.e. it's in \w and therefore not excluded by \W).
From the [perl] regex man page:
\w Match a "word" character (alphanumeric plus "_", plus other connector punctuation chars plus Unicode marks)
Your final regex might need to be: (?!(XNIL|SNIL|_+|\W+))\b\w+. (Note: the _+)
A cleaner way: (?!(XNIL|SNIL|[\W_]+))\b\w+ which produces the same results yet is closer in intent to what you wanted.
You may have to adjust \w+ accordingly as well
If you really want to be sure, at the expense of being slightly more verbose, write out the character class as you choose:
(?!(XNIL|SNIL|[^a-zA-Z0-9]+))\b[a-zA-Z0-9]+
Check this regex
[^(XNIL|SNIL|[^\w])]
Explanation
[] having ^ at beginning says the that any thing that is not there in the list given in [] should be matched.
(XNIL|SNIL|[^\w+]) matches words XNIL or SNIL or [^\w] matches anything other than words(i.e. special chars)
So the whole regex matches any thing that is not there in [^(XNIL|SNIL|[^\w])]
This should work
(?m)^(((?!XNIL|SNIL|[\W]).)*)$
Grouping the character match with the negative lookahead will cause the zero length assertion to continue until finished (in this case at the end of the string due to $)

regex to catch dollar amount catching # symbol too

I'm trying to match simple dollar strings ($34.21). My regex is as follows.
\$\d+.([0-9][0-9])
I'm catching the following string though and I don't know why.
#$23.23
Does the at symbol have some kind of special meaning? I don't see it on my regex cheat sheet and this is bugging me.
Thanks, mj
You should escape the \. and you probably need to add start (^) and end ($) anchors around the pattern:
^\$\d+\.([0-9][0-9])$
The anchors are used to ensure that no other characters are allowed in the input string before or after the matched string.
Also, depending on the exact language / platform your using*, this can probably be further simplified to:
^\$\d+\.(\d\d)$
* Some regex engines treat \d as equivalent to [0-9], while on others it will match any Unicode digit, including those from other numeral systems.
Use line start and line end anchors to make sure you don't match unwanted input:
^\$\d+\.([0-9][0-9])$
OR
^\$\d+\.\d{1,2}$

Regex to match anything

I know it seems a bit redundant but I'd like a regex to match anything.
At the moment we are using ^*$ but it doesn't seem to match no matter what the text.
I do a manual check for no text but the test view we use is always validated with a regex. However, sometimes we need it to validate anything using a regex. i.e. it doesn't matter what is in the text field, it can be anything.
I don't actually produce the regex and I'm a complete beginner with them.
The regex .* will match anything (including the empty string, as Junuxx points out).
The chosen answer is slightly incorrect, as it wont match line breaks or returns. This regex to match anything is useful if your desired selection includes any line breaks:
[\s\S]+
[\s\S] matches a character that is either a whitespace character (including line break characters), or a character that is not a whitespace character. Since all characters are either whitespace or non-whitespace, this character class matches any character. the + matches one or more of the preceding expression
^ is the beginning-of-line anchor, so it will be a "zero-width match," meaning it won't match any actual characters (and the first character matched after the ^ will be the first character of the string). Similarly, $ is the end-of-line anchor.
* is a quantifier. It will not by itself match anything; it only indicates how many times a portion of the pattern can be matched. Specifically, it indicates that the previous "atom" (that is, the previous character or the previous parenthesized sub-pattern) can match any number of times.
To actually match some set of characters, you need to use a character class. As RichieHindle pointed out, the character class you need here is ., which represents any character except newlines (and it can be made to match newlines as well using the appropriate flag). So .* represents * (any number) matches on . (any character). Similarly, .+ represents + (at least one) matches on . (any character).
I know this is a bit old post, but we can have different ways like :
.*
(.*?)

Regular Expression to test an entire word

i have this expression ([a-zA-Z]|ñ|Ñ)* which i want to use to block all characters but letters and Ñ to be entered on a textbox.
The problem is that return a match for: A9023 but also for 32""". How can i do to return a match for A9023 but not for 32""".
Thanks.
You need to add assertions for the start and the end of the string:
^([a-zA-Z]|ñ|Ñ)*$
Otherwise the regular expression matches at any position. Additionally, you can also write ([a-zA-Z]|ñ|Ñ)* as the character class [a-zA-ZñÑ]*:
^[a-zA-ZñÑ]*$
Sure that you don't mean ^([a-zA-Z]|ñ|Ñ)*$ -- you might be finding the characters you want but not excluding what you don't? The expression I mentioned will pin to the beginning ^ and the end $ of the string, so that nothing else will pass. Otherwise:
123ABC456
...will pass your match, because it found 0-or-more letters... though there were also other letters.
You didn't say which regex flavor (which programming language) you're using, but you might want to consider either
^\p{L}*$
if your regex flavor supports Unicode properties or
^[^\W\d_]*$
if it doesn't.
Reason: Your regex will allow only unaccented letters and Ñ - is there a real language that uses the latter without also having accented letters?
\p{L} means "any letter in any 'language'",
[^\W\d_] means "any character that is neither a non-alphanumeric, a digit or an underscore", which is just a fancy but necessary way to say "any letter" (\w is a shorthand for "letter, digit or underscore", \W is the inverse of that).

Regex for alphanumeric, but at least one letter

In my ASP.NET page, I have an input box that has to have the following validation on it:
Must be alphanumeric, with at least one letter (i.e. can't be ALL
numbers).
^\d*[a-zA-Z][a-zA-Z0-9]*$
Basically this means:
Zero or more ASCII digits;
One alphabetic ASCII character;
Zero or more alphanumeric ASCII characters.
Try a few tests and you'll see this'll pass any alphanumeric ASCII string where at least one non-numeric ASCII character is required.
The key to this is the \d* at the front. Without it the regex gets much more awkward to do.
Most answers to this question are correct, but there's an alternative, that (in some cases) offers more flexibility if you want to change the rules later on:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]+)$
This will match any sequence of alphanumerical characters, but only if the first group also matches the whole sequence. It's a little-known trick in regular expressions that allows you to handle some very difficult validation problems.
For example, say you need to add another constraint: the string should be between 6 and 12 characters long. The obvious solutions posted here wouldn't work, but using the look-ahead trick, the regex simply becomes:
^(?=.*[a-zA-Z].*)([a-zA-Z0-9]{6,12})$
^[\p{L}\p{N}]*\p{L}[\p{L}\p{N}]*$
Explanation:
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
\p{L} matches one letter
[\p{L}\p{N}]* matches zero or more Unicode letters or numbers
^ and $ anchor the string, ensuring the regex matches the entire string. You may be able to omit these, depending on which regex matching function you call.
Result: you can have any alphanumeric string except there's got to be a letter in there somewhere.
\p{L} is similar to [A-Za-z] except it will include all letters from all alphabets, with or without accents and diacritical marks. It is much more inclusive, using a larger set of Unicode characters. If you don't want that flexibility substitute [A-Za-z]. A similar remark applies to \p{N} which could be replaced by [0-9] if you want to keep it simple. See the MSDN page on character classes for more information.
The less fancy non-Unicode version would be
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[A-Za-z][0-9A-Za-z]*$
is the regex that will do what you're after. The ^ and $ match the start and end of the word to prevent other characters. You could replace the [0-9A-z] block with \w, but i prefer to more verbose form because it's easier to extend with other characters if you want.
Add a regular expression validator to your asp.net page as per the tutorial on MSDN: http://msdn.microsoft.com/en-us/library/ms998267.aspx.
^\w*[\p{L}]\w*$
This one's not that hard. The regular expression reads: match a line starting with any number of word characters (letters, numbers, punctuation (which you might not want)), that contains one letter character (that's the [\p{L}] part in the middle), followed by any number of word characters again.
If you want to exclude punctuation, you'll need a heftier expression:
^[\p{L}\p{N}]*[\p{L}][\p{L}\p{N}]*$
And if you don't care about Unicode you can use a boring expression:
^[A-Za-z0-9]*[A-Za-z][A-Za-z0-9]*$
^[0-9]*[a-zA-Z][a-zA-Z0-9]*$
Can be
any number ended with a character,
or an alphanumeric expression started with a character
or an alphanumeric expression started with a number, followed by a character and ended with an alphanumeric subexpression