How do dashes work in regex? - regex

I'm curious on the algorithm for deciding which characters to include, in a regex when using a -...
Example: [a-zA-Z0-9]
This matches any character of any case, a through z, and numbers 0 through 9.
I had originally thought that they were used sort of like macros, for example, a-z translates to a,b,c,d,e etc.. but after I saw the following in an open source project,
text.tr('A-Za-z1-90', 'Ⓐ-Ⓩⓐ-ⓩ①-⑨⓪')
my paradigm on regex's has changed entirely, because these are characters that are not your typical characters, so how the heck did this work correctly, i thought to myself.
My theory is that the - literally means
Any ASCII value between the left character, and the right character. (e.g. a-z [97-122])
Could anybody confirm if my theory is correct? Does the regex pattern in-fact calculate using the character codes, between any character?
Furthermore, if it IS correct, could you perform a regex match like,
A-z
because A is 65, and z is 122 so theoretically, it should also match all characters between those values.

From MSDN - Character Classes in Regular Expressions (bold is mine):
The syntax for specifying a range of characters is as follows:
[firstCharacter-lastCharacter]
where firstCharacter is the character that begins the range and lastCharacter is the character that ends the range. A character range is a contiguous series of characters defined by specifying the first character in the series, a hyphen (-), and then the last character in the series. Two characters are contiguous if they have adjacent Unicode code points.
So your assumption is correct, but the effect is, in fact, wider: Unicode character codes, not just ASCII.

Both of your assumptions are correct. (therefore, technically you could do [#-~] and it would still be valid, capturing uppercase letters, lowercase letters, numbers, and certain symbols.)
ASCII Table
You can also do this with Unicode, like [\u0000-\u1000].
You should not do [A-z], however, because there are some characters between the uppercase and lowercase letters (specifically [, \, ], ^, _, `).

Related

RegEx: Non-repeating patterns?

I'm wrestling with how to write a specific regex, and thought I'd come here for a little guidance.
What I'm looking for is an expression that does the following:
Character length of 7 or more
Any single character is one of four patterns (uppercase letters, lowercase letters, numbers and a specific set of special characters. Let's say #$%#).
(Now, here's where I'm having problems):
Another single character would also match with one of the patterns described above EXCEPT for the pattern that was already matched. So, if the first pattern matched is an uppercase letter, the second character match should be a lowercase letter, number or special character from the pattern.
To give you an example, the string AAAAAA# would match, as would the string AAAAAAa. However, the string AAAAAAA, nor would the string AAAAAA& (as the ampersand was not part of the special character pattern).
Any ideas? Thanks!
If you only need two different kinds of characters, you can use the possessive quantifier feature (available in Objective C):
^(?:[a-z]++|[A-Z]++|[0-9]++|[#$%#]++)[a-zA-Z0-9#$%#]+$
or more concise with an atomic group:
^(?>[a-z]+|[A-Z]+|[0-9]+|[#$%#]+)[a-zA-Z0-9#$%#]+$
Since each branch of the alternation is a character class with a possessive quantifier, you can be sure that the first character matched by [a-zA-Z0-9#$%#]+ is from a different class.
About the string size, check it first separately with the appropriate function, if the size is too small, you will avoid the cost of a regex check.
First you need to do a negative lookahead to make sure the entire string doesn't consist of characters from a single group:
(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)
Then check that it does contain at least 7 characters from the list of legal characters (and nothing else):
^[a-zA-Z0-9#$%#]{7,}$
Combining them (thanks to Shlomo for pointing that out):
^(?!(?:[a-z]*|[A-Z]*|[0-9]*|[#$%#]*)$)[a-zA-Z0-9#$%#]{7,}$

regular expression for all characters not in this range AND another character

I need to strip all characters that are not part of the ASCII standard EXCEPT FROM one other character.
In order to find all the non-ASCII character I use this regex:
/[^\x01-\x7F]/
In order to exclude the character \x92 as well, how must I rewrite the regex?
Thanks.
Just stick it in there with the others:
/[^\x01-\x7F\x92]/
The elements of a character class can be individual characters or character ranges, and they can be freely mixed just by sticking them one after the other.

What characters would be included in regex range a-Z? [duplicate]

This question already has answers here:
Is the regular expression [a-Z] valid and if yes then is it the same as [a-zA-Z]?
(7 answers)
Closed 3 years ago.
If I have a regex that is [0-Z] or [a-Z] - what characters would it match? Is it valid regex? Can you have ranges in regex outside of 0-9, a-z and A-Z?
Yes, you can have other ranges. From MSDN - Character Classes in Regular Expressions (bold is mine):
The syntax for specifying a range of characters is as follows:
[firstCharacter-lastCharacter]
where firstCharacter is the character that begins the range and lastCharacter is the character that ends the range. A character range is a contiguous series of characters defined by specifying the first character in the series, a hyphen (-), and then the last character in the series. Two characters are contiguous if they have adjacent Unicode code points.
So, in the end, [0-Z] will match 0123456789:;<=>?ABCDEFGHIJKLMNOPQRSTUVWXYZ. You can check the ASCII table for 0-Z.
As for [a-Z], as they don't specify a contiguous series, they should match nothing.
Just keep in mind, for the general rule, the effect can be wide: Unicode character codes, not just ASCII - ultimately, of course, it depends on the implementation, so, if in doubt, check it.
The range [0-Z] is valid, depending on the regex engine [a-Z] will either be invalid or it will be a range that can't match any characters. In a character class range the start and end characters are just code points and all characters between those code points will be included in the range.
In the case of [0-Z], this is equivalent to the following more readable character class:
[0-9:;<=>?#A-Z]
In the case of [a-Z], this is actually a character class that won't match anything because a has a higher code point than Z.
You can see the code points in the following ASCII table from http://www.asciitable.com/:
Ranges depend on the character's (unicode) value. A range from [0-9] makes sense, but a range from [9-0] does not. Likewise, a range from [a-Z] will be empty because 'a' is greater than 'Z'. (All the uppercase letters come first, and there are intervening characters between 'Z' and 'a'). Rely on a table of character values (pull up charmap on Windows), and don't get fancy.
You can create any range as long as the order of the characters' unicode value is lower to higher. Take ascii for example. a is higher in order than Z, so the range a-Z is invalid. The range A-z is valid, but you should note that this includes non-letter characters like ^ and [. 0-Z is also valid and includes :, ?, and a whole bunch of other characters you probably don't want.
To answer your question, you can create any range in the right order. It may not be useful to use something like A-z, but something like a-d is pretty common.
Regex engines may react differently to ranges that are out of order or otherwise invalid.

How to include special chars in this regex

First of all I am a total noob to regular expressions, so this may be optimized further, and if so, please tell me what to do. Anyway, after reading several articles about regex, I wrote a little regex for my password matching needs:
(?=.*[A-Z])(?=.*[a-z])(?=.*[0-9])(^[A-Z]+[a-z0-9]).{8,20}
What I am trying to do is: it must start with an uppercase letter, must contain a lowercase letter, must contain at least one number must contain at least on special character and must be between 8-20 characters in length.
The above somehow works but it doesn't force special chars(. seems to match any character but I don't know how to use it with the positive lookahead) and the min length seems to be 10 instead of 8. what am I doing wrong?
PS: I am using http://gskinner.com/RegExr/ to test this.
Let's strip away the assertions and just look at your base pattern alone:
(^[A-Z]+[a-z0-9]).{8,20}
This will match one or more uppercase Latin letters, followed by by a single lowercase Latin letter or decimal digit, followed by 8 to 20 of any character. So yes, at minimum this will require 10 characters, but there's no maximum number of characters it will match (e.g. it will allow 100 uppercase letters at the start of the string). Furthermore, since there's no end anchor ($), this pattern would allow any trailing characters after the matched substring.
I'd recommend a pattern like this:
^(?=.*[a-z])(?=.*[0-9])(?=.*[!##$])[A-Z]+[A-Za-z0-9!##$]{7,19}$
Where !##$ is a placeholder for whatever special characters you want to allow. Don't forget to escape special characters if necessary (\, ], ^ at the beginning of the character class, and- in the middle).
Using POSIX character classes, it might look like this:
^(?=.*[:lower:])(?=.*[:digit:])(?=.*[:punct:])[:upper:]+[[:alnum:][:punct:]]{7,19}$
Or using Unicode character classes, it might look like this:
^(?=.*[\p{Ll}])(?=.*\d)(?=.*[\p{P}\p{S}])[\p{Lu}]+[\p{L}\d\p{P}\p{S}]{7,19}$
Note: each of these considers a different set of 'special characters', so they aren't identical to the first pattern.
The following should work:
^(?=.*[a-z])(?=.*[0-9])(?=.*[^a-zA-Z0-9])[A-Z].{7,19}$
I removed the (?=.*[A-Z]) because the requirement that you must start with an uppercase character already covers that. I added (?=.*[^a-zA-Z0-9]) for the special characters, this will only match if there is at least one character that is not a letter or a digit. I also tweaked the length checking a little bit, the first step here was to remove the + after the [A-Z] so that we know exactly one character has been matched so far, and then changing the .{8,20} to .{7,19} (we can only match between 7 and 19 more characters if we already matched 1).
Well, here is how I would write it, if I had such requirements - excepting situations where it's absolutely not possible or practical, I prefer to break up complex regular expressions. Note that this is English-specific, so a Unicode or POSIX character class (where supported) may make more sense:
/^[A-Z]/ && /[a-z]/ && /[1-9]/ && /[whatever special]/ && ofCorrectLength(x)
That is, I would avoid trying to incorporate all the rules at once.

Literal Characters in Regex Character Classes

While looking through some regex stuff, I found that you could put Literal Characters inside of a character class. I know when using character classes you can use ranges to shortcut instead of specifying every letter/number in a range, IE: [1-47-9] matches every number except 0,5,6.
If you have a regex including literal characters in a character class, does it treat this the same way and match the range of those characters? For example, would [\000-\005] positively match \000, \001, \002, \003, \004, \005?
Yes, it does work this way. You can specify a range between any arbitrary characters and as long as the code point of the left side is less than the code point of the right side the range will match any character between them (inclusive).