Literal Characters in Regex Character Classes - regex

While looking through some regex stuff, I found that you could put Literal Characters inside of a character class. I know when using character classes you can use ranges to shortcut instead of specifying every letter/number in a range, IE: [1-47-9] matches every number except 0,5,6.
If you have a regex including literal characters in a character class, does it treat this the same way and match the range of those characters? For example, would [\000-\005] positively match \000, \001, \002, \003, \004, \005?

Yes, it does work this way. You can specify a range between any arbitrary characters and as long as the code point of the left side is less than the code point of the right side the range will match any character between them (inclusive).

Related

regular expression for all characters not in this range AND another character

I need to strip all characters that are not part of the ASCII standard EXCEPT FROM one other character.
In order to find all the non-ASCII character I use this regex:
/[^\x01-\x7F]/
In order to exclude the character \x92 as well, how must I rewrite the regex?
Thanks.
Just stick it in there with the others:
/[^\x01-\x7F\x92]/
The elements of a character class can be individual characters or character ranges, and they can be freely mixed just by sticking them one after the other.

What characters would be included in regex range a-Z? [duplicate]

This question already has answers here:
Is the regular expression [a-Z] valid and if yes then is it the same as [a-zA-Z]?
(7 answers)
Closed 3 years ago.
If I have a regex that is [0-Z] or [a-Z] - what characters would it match? Is it valid regex? Can you have ranges in regex outside of 0-9, a-z and A-Z?
Yes, you can have other ranges. From MSDN - Character Classes in Regular Expressions (bold is mine):
The syntax for specifying a range of characters is as follows:
[firstCharacter-lastCharacter]
where firstCharacter is the character that begins the range and lastCharacter is the character that ends the range. A character range is a contiguous series of characters defined by specifying the first character in the series, a hyphen (-), and then the last character in the series. Two characters are contiguous if they have adjacent Unicode code points.
So, in the end, [0-Z] will match 0123456789:;<=>?ABCDEFGHIJKLMNOPQRSTUVWXYZ. You can check the ASCII table for 0-Z.
As for [a-Z], as they don't specify a contiguous series, they should match nothing.
Just keep in mind, for the general rule, the effect can be wide: Unicode character codes, not just ASCII - ultimately, of course, it depends on the implementation, so, if in doubt, check it.
The range [0-Z] is valid, depending on the regex engine [a-Z] will either be invalid or it will be a range that can't match any characters. In a character class range the start and end characters are just code points and all characters between those code points will be included in the range.
In the case of [0-Z], this is equivalent to the following more readable character class:
[0-9:;<=>?#A-Z]
In the case of [a-Z], this is actually a character class that won't match anything because a has a higher code point than Z.
You can see the code points in the following ASCII table from http://www.asciitable.com/:
Ranges depend on the character's (unicode) value. A range from [0-9] makes sense, but a range from [9-0] does not. Likewise, a range from [a-Z] will be empty because 'a' is greater than 'Z'. (All the uppercase letters come first, and there are intervening characters between 'Z' and 'a'). Rely on a table of character values (pull up charmap on Windows), and don't get fancy.
You can create any range as long as the order of the characters' unicode value is lower to higher. Take ascii for example. a is higher in order than Z, so the range a-Z is invalid. The range A-z is valid, but you should note that this includes non-letter characters like ^ and [. 0-Z is also valid and includes :, ?, and a whole bunch of other characters you probably don't want.
To answer your question, you can create any range in the right order. It may not be useful to use something like A-z, but something like a-d is pretty common.
Regex engines may react differently to ranges that are out of order or otherwise invalid.

How to include special chars in this regex

First of all I am a total noob to regular expressions, so this may be optimized further, and if so, please tell me what to do. Anyway, after reading several articles about regex, I wrote a little regex for my password matching needs:
(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(^[A-Z]+[a-z0-9]).{8,20}
What I am trying to do is: it must start with an uppercase letter, must contain a lowercase letter, must contain at least one number must contain at least on special character and must be between 8-20 characters in length.
The above somehow works but it doesn't force special chars(. seems to match any character but I don't know how to use it with the positive lookahead) and the min length seems to be 10 instead of 8. what am I doing wrong?
PS: I am using http://gskinner.com/RegExr/ to test this.
Let's strip away the assertions and just look at your base pattern alone:
(^[A-Z]+[a-z0-9]).{8,20}
This will match one or more uppercase Latin letters, followed by by a single lowercase Latin letter or decimal digit, followed by 8 to 20 of any character. So yes, at minimum this will require 10 characters, but there's no maximum number of characters it will match (e.g. it will allow 100 uppercase letters at the start of the string). Furthermore, since there's no end anchor ($), this pattern would allow any trailing characters after the matched substring.
I'd recommend a pattern like this:
^(?=.*[a-z])(?=.*[0-9])(?=.*[!##$])[A-Z]+[A-Za-z0-9!##$]{7,19}$
Where !##$ is a placeholder for whatever special characters you want to allow. Don't forget to escape special characters if necessary (\, ], ^ at the beginning of the character class, and- in the middle).
Using POSIX character classes, it might look like this:
^(?=.*[:lower:])(?=.*[:digit:])(?=.*[:punct:])[:upper:]+[[:alnum:][:punct:]]{7,19}$
Or using Unicode character classes, it might look like this:
^(?=.*[\p{Ll}])(?=.*\d)(?=.*[\p{P}\p{S}])[\p{Lu}]+[\p{L}\d\p{P}\p{S}]{7,19}$
Note: each of these considers a different set of 'special characters', so they aren't identical to the first pattern.
The following should work:
^(?=.*[a-z])(?=.*[0-9])(?=.*[^a-zA-Z0-9])[A-Z].{7,19}$
I removed the (?=.*[A-Z]) because the requirement that you must start with an uppercase character already covers that. I added (?=.*[^a-zA-Z0-9]) for the special characters, this will only match if there is at least one character that is not a letter or a digit. I also tweaked the length checking a little bit, the first step here was to remove the + after the [A-Z] so that we know exactly one character has been matched so far, and then changing the .{8,20} to .{7,19} (we can only match between 7 and 19 more characters if we already matched 1).
Well, here is how I would write it, if I had such requirements - excepting situations where it's absolutely not possible or practical, I prefer to break up complex regular expressions. Note that this is English-specific, so a Unicode or POSIX character class (where supported) may make more sense:
/^[A-Z]/ && /[a-z]/ && /[1-9]/ && /[whatever special]/ && ofCorrectLength(x)
That is, I would avoid trying to incorporate all the rules at once.

Regex for minimum number of characters

I created this regular expression to validate names:
^[a-zA-Z0-9\s\-\,]+.\*?$
Is there a way add the minimum number of characters?
I know we can use {x,}, but I cannot make it work.
{x,} should be used instead of + here...
^[a-zA-Z0-9\s,-]{5,}
But this would mean, "at least 5 characters in the beginning match those from the character class, and then anything...
If you write it like this (almost your original - just with {5,} instead of +):
^[a-zA-Z0-9\s\-\,]{5,}.\*?$
This means "at least 5 characters in the beginning match those from the character class, and any one character, and then optionally an asterisk, and that should be the end of it".
Use a lookahead at the beginning of the regex to make sure the total number of characters is at least your minimum. For example, if your minimum is 8 characters:
^(?=.{8,})[a-zA-Z0-9\s\-,]+.\*?$
Also, you don't need to escape the comma.

Regular expression check for special character or number

I have such regular expression which checked for at least one special character in the string:
^(.*[^0-9a-zA-Z].*)$
But how could i change this one to check for at least one special character or at leas one number in the string?
.*[^a-zA-Z]+.*
would match anything followed by a special character followed by anything.
Notice that I just removed the 0-9 from the character class (characters included in the square brackets).
Also, I removed the ^ and $ markers -- those match the beginning and end of string respectively. You don't need it because you're making it redundant with the .* (match zero or more of any character) anyway.
In fact, if you're just checking if the string contains a special character, then the following is good enough:
[^a-zA-Z]
you can use the Expresso, it is a smart tool for generate RegExps Expresso