Regular Expression for validating username - regex

I'm looking for a regular expression to validate a username.
The username may contain:
Letters (western, greek, russian etc.)
Numbers
Spaces, but only 1 at a time
Special characters (for example: "!##$%^&*.:;<>?/\|{}[]_+=-") , but only 1 at a time
EDIT:
Sorry for the confusion
I need it for cocoa-touch but i'll have to translate it for php for the server side anyway.
And with 1 at a time i mean spaces or special char's should be separated by letters or numbers.

Instead of writing one big regular expression, it would be clearer to write separate regular expressions to test each of your desired conditions.
Test whether the username contains only letters, numbers, ASCII symbols ! through #, and space: ^(\p{L}|\p{N}|[!-#]| )+$. This must match for the username to be valid. Note the use of the \p{L} class for Unicode letters and the \p{N} class for Unicode numbers.
Test whether the the username contains consecutive spaces: \s\s+. If this matches, the username is invalid.
Test whether symbols occur consecutively: [!-#][!-#]+. If this matches, the username is invalid.
This satisfies your criteria exactly as written.
However, depending on how the usernames have been written, perfectly valid names like "Éponine" may still be rejected by this approach. This is because the "É" could be written either as U+00C9 LATIN CAPITAL E WITH ACUTE (which is matched by \p{L}) or something like E followed by U+02CA MODIFIER LETTER ACUTE ACCENT (which is not matched by \p{L}.)
Regular-Expressions.info says it better:
Again, "character" really means "Unicode code point". \p{L} matches a
single code point in the category "letter". If your input string is à
encoded as U+0061 U+0300, it matches a without the accent. If the
input is à encoded as U+00E0, it matches à with the accent. The reason
is that both the code points U+0061 (a) and U+00E0 (à) are in the
category "letter", while U+0300 is in the category "mark".
Unicode is hairy, and restricting the characters in usernames is not necessarily a good idea anyway. Are you sure you want to do this?

The expression
^(\w| (?! )|["!##$%^&*.:;<>?/\|{}\[\]_+=\-")](?!["!##$%^&*.:;<>?/\|{}\[\]_+=\-")]))*$
will mostly do what you want, if your dialect support look-ahead assertions.
See it in action at RegExr.
Please ask yourself why you want to limit usernames in this way. Most of the time usernames starting with "!!" should be not an issue, and you annoy users if you reject their desired username.
Edit: \w does not match non-latin characters. To do this, replace \w with \p{L} wich may, or may not work depending on your regex implementation. Regexr unfortunately does not support it.

Try this:
^[!##$%^&*.:;<>?\/\|{}\[\]_+= -]?([\p{L}\d]+[!##$%^&*.:;<>?/\|{}\[\]_+= -]?)+$
See on rubular

You want something like
string strUserName = "BillYBob Stev#nS0&";
Regex regex = new Regex(#"(?i)\b(\w+\p{P}*\p{S}*\p{Z}*\p{C}*\s?)+\b");
Match match = regex.Match(strUserName);
If you want this explaining, let me know.
I hope this helps.
Note: This is case insensitive.

Since I don't know in what language you need this solution, I am providing answer in Java. It can be translated in any other platform:
String str = "à123 àà#bcà#";
String regex = "^([\\p{L}\\d]+[!##$%\\^&\\*.:;<>\\?/\\|{}\\[\\]_\\+=\\s-]?)+$";
Pattern p = Pattern.compile(regex);
matcher = p.matcher(str);
if (matcher.find())
System.out.println("Matched: " + matcher.group());
One assumption I made is that username will start with either an unicode letter or a number.

Related

C# regex for match romanian number in text

I try to implement regex that will match romanian numbers in text.
Here is my regex:
^ | \s+[xivXIV]+\s+ | $
So it mean 'Begin string or whitespace one or more times, than any of xivXIV one or more times, then whitespace one or more times or string end.'
But it seems its not work for me.
F.e. i have a simple string 'xiv' and it not matched against this pattern.
EDIT: Suggested post is about how if string literal match to romanian number, instead i want to 'smart' extract those literals from text, so it should handle cases like
'visit' it should not take 'vi' but if 'ix table of contents' it should take 'ix'
EDIT 2: Thanks to all replies, the exp should be:
\b[xivXIV]+\b
NOTE: in my part case i only need handle XIV literals (not full romanian system) thats because i need some simpler solution
You can use the answer from this Q&A and adapt it so that it matches substrings embedded in other text:
The accepted answer there has this:
^M{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$
Replace the start/end anchors (^ and $) by word breaks (\b):
\bM{0,4}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})\b
Note that the simpler \b[xivXIV]+\b which you mentioned in your second question-edit would accept invalid roman numbers like:
IXI
XXXXX
and would not recognise these valid ones:
CM
LX
In a later edit of your question you wrote that you only want "to handle XIV literals (not full romanian[sic] system)". Still you could then take the corresponding part of the above mentioned regular expression to exclude the invalid combinations of those three letters:
\bX{0,3}(IX|IV|V?I{0,3})\b
NB: for case-insensitivity you would add the i modifier.

Regular Expression to exclude numerical emailids

I have below set of sample emailids
EmailAddress
1123
123.123
123_123
123#123.123
123#123.com
123#abc.com
123mbc#abc.com
123mbc#123abc.com
123mbc#123abc.123com
123mbc123#cc123abc.c123com
Need to eliminate mailids if they contain entirely numericals before #
Expected output:
123mbc#abc.com
123mbc#123abc.com
123mbc#123abc.123com
123mbc123#cc123abc.c123com
I used below Java Rex. But its eliminating everything. I have basic knowledge in writing these expressions. Please help me in correcting below one. Thanks in advance.
[^0-9]*#.*
do you mean something like this ? (.*[a-zA-Z].*[#]\w*\.\w*)
breakdown .* = 0 or more characters [a-zA-Z] = one
letter .* = 0 or more characters #
\w*\.\w* endless times a-zA-Z0-9 with a single . in between
this way you have the emails that contains at least one letter
see the test at https://regex101.com/r/qV1bU4/3
edited as suggest by ccf with updated breakdown
The following regex only lets email adresses pass that meet your specs:
(?m)^.*[^0-9#\r\n].*#
Observe that you have to specify multi-line matching ( m flag. See the live demo. The solution employs the embedded flag syntax m flag. You can also call Pattern.compile with the Pattern.MULTILINE argument. ).
Live demo at regex101.
Explanation
Strategy: Define a basically sound email address as a single-line string containing a #, exclude strictly numerical prefixes.
^: start-of-line anchor
#: a basically sound email address must match the at-sign
[^...]: before the at sign, one character must neither be a digit nor a CR/LF. # is also included, the non-digit character tested for must not be the first at-sign !
.*: before and after the non-digit tested for, arbitrary strings are permitted ( well, actually they aren't, but true syntactic validation of the email address should probably not happen here and should definitely not be regex based for reasons of reliability and code maintainability ). The strings need to be represented in the pattern, because the pattern is anchored.
Try this one:
[^\d\s].*#.+
it will match emails that have at least one letter or symbol before the # sign.

Simple regex - finding words including numbers but only on occasion

I'm really bad at regex, I have:
/(#[A-Za-z-]+)/
which finds words after the # symbol in a textbox, however I need it to ignore email addresses, like:
foo#things.com
however it finds #things
I also need it to include numbers, like:
#He2foo
however it only finds the #He part.
Help is appreciated, and if you feel like explaining regex in simple terms, that'd be great :D
/(?:^|(?<=\s))#([A-Za-z0-9]+)(?=[.?]?\s)/
#This (matched) regex ignores#this but matches on #separate tokens as well as tokens at the end of a sentence like #this. or #this? (without picking the . or the ?) And yes email#addresses.com are ignored too.
The regex while matching on # also lets you quickly access what's after it (like userid in #userid) by picking up the regex group(1). Check PHP documentation on how to work with regex groups.
You can just add 0-9 to your regex, like so:
/(#[A-Za-z0-9-]+)/
Don't think any more explanation is needed since you've been able to come this far by yourself. 0-9 is just like a-z (though numeric ofcourse).
In order to ignore emailaddresses you will need to provide more specific requirements. You could try preceding # with (^| ) which basically states that your value MUST be preceeded by either the start of the string (so nothing really, though at the start) or a space.
Extending this you can also use ($| ) on the end to require the value to be followed by the end of the string or a space (which means there's no period allowed, which is requirement for a valid emailaddress).
Update
$subject = "#a #b a#b a# #b";
preg_match_all("/(^| )#[A-Za-z0-9-]+/", $subject, $matches);
print_r($matches[0]);

HTML5 - input, using placeholder attribute for an "example", possible to use pattern attribute to ensure that "example" is not submitted?

I am creating a registration form.
I want to use the placeholder attribute on a password input to explain, in part, what type of regex is used for validation, using the pattern attribute.
This is the regex i found at www.html5pattern.com :
(?=^.{6,}$)((?=.*\d)|(?=.*\W+))(?![.\n])(?=.*[A-Z])(?=.*[a-z]).*$
The explanation for this regex was as follows:
Password (UpperCase, LowerCase, Number/SpecialChar and min 6 Chars)
The example i have used in the placeholder attribute, along with the title attribute, is this : Examp1e.
I would like to ensure that a user does not specifically enter "Examp1e" as their password.
Does anyone have any advice, suggestions, or input as to how i should go about this task?
That regex you started with is bad. For one thing, JavaScript's \W is equivalent to [^A-Za-z0-9_]: any character that isn't an ASCII word character. That includes all of the ASCII punctuation, whitespace and control characters, plus all non-ASCII characters. There's no official definition for "special characters" that I know of, but I'm pretty sure this is not what the author meant.
To move this along, I'll assume only ASCII characters are allowed in the password, and that "special characters" refers to punctuation characters:
[!"#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]
That would make the regex
^(?=[!-~]{6,20}$)(?=.*[A-Z])(?=.*[a-z])(?=.*[\d!"#$%&'()*+,\-./:;<=>?#\[\\\]^_`{|}~]).*$
Notes:
Notice how I pulled the ^ out of the first group; it's there to anchor the whole regex, not just that one lookahead.
I also merged the "digits" and "specials" lookaheads into one. It's not a big deal in this case, but one of my rules thumb is that you should never use an alternation if a character class will do the job.
[!-~] is an old Perl idiom for any "visible" ASCII character (i.e., anything but whitespace or control characters).
I haven't the slightest idea what the original author was trying to do with that (?![.\n]).
This regex works very well for me when validating an input via the html5 pattern attribute, as i have not been able to produce an invalid out of a valid address:
(?=^.{6,20}$)((?=.*\d)|(?=.*\W+))(?![.\n])(?=.*[A-Z])(?=.*[a-z]).*$

Does \w match all alphanumeric characters defined in the Unicode standard?

Does Perl's \w match all alphanumeric characters defined in the Unicode standard?
For example, will \w match all (say) Chinese and Russian alphanumeric characters?
I wrote a simple test script (see below) which suggests that \w does indeed match "as expected" for the non-ASCII alphanumeric characters I tested. But the testing is obviously far from exhaustive.
#!/usr/bin/perl
use utf8;
binmode(STDOUT, ':utf8');
my #ok;
$ok[0] = "abcdefghijklmnopqrstuvwxyz";
$ok[1] = "éèëáàåäöčśžłíżńęøáýąóæšćôı";
$ok[2] = "şźüęłâi̇ółńśłŕíáυσνχατςęςη";
$ok[3] = "τσιαιγολοχβςανنيرحبالтераб";
$ok[4] = "иневоаслкłјиневоцедањеволс";
$ok[5] = "рглсывызтоμςόκιναςόγο";
foreach my $ok (#ok) {
die unless ($ok =~ /^\w+$/);
}
perldoc perlunicode says
Character classes in regular expressions match characters instead of bytes and match against the character properties specified in the Unicode properties database. \w can be used to match a Japanese ideograph, for instance.
So it looks like the answer to your question is "yes".
However, you might want to use the \p{} construct to directly access specific Unicode character properties. You can probably use \p{L} (or, shorter, \pL) for letters and \pN for numbers and feel a little more confident that you'll get exactly what you want.
Yes and no.
If you want all alphanumerics, you want [\p{Alphabetic}\p{GC=Number}]. The \w contains both more and less than that. It specifically excludes any \pN which is not \p{Nd} nor \p{Nl}, like the superscripts, subscripts, and fractions. Those are \p{GC=Other_Number}, and are not included in \w.
Because unlike most regex systems, Perl complies with Requirement 1.2a, “Compatibility Properties” from UTS #18 on Unicode Regular Expressions, then assuming you have Unicode strings, a \w in a regex matches any single code point that has any of the following four properties:
\p{GC=Alphabetic}
\p{GC=Mark}
\p{GC=Connector_Punctuation}
\p{GC=Decimal_Number}
Number 4 above can be expressed in any of these ways, which are all considered equivalent:
\p{Digit}
\p{General_Category=Decimal_Number}
\p{GC=Decimal_Number}
\p{Decimal_Number}
\p{Nd}
\p{Numeric_Type=Decimal}
\p{Nt=De}
Note that \p{Digit} is not the same as \p{Numeric_Type=Digit}. For example, code point B2, SUPERSCRIPT TWO, has only the \p{Numeric_Type=Digit} property and not plain \p{Digit}. That is because it is considered a \p{Other_Number} or \p{No}. It does, however, have the \p{Numeric_Value=2} property as you would imagine.
It’s really point number 1 above, \p{Alphabetic} ,that gives people the most trouble. That’s because they too often mistakenly think it is somehow the same as \p{Letter} (\pL), but it is not.
Alphabetics include much more than that, all because of the \p{Other_Alphabetic} property, as this in turn
includes some but not all \p{GC=Mark}, all of \p{Lowercase} (which is not the same as \p{GC=Ll} because it adds \p{Other_Lowercase}) and all of \p{Uppercase} (which is not the same as \p{GC=Lu} because it adds \p{Other_Uppercase}).
That’s how it pulls in \p{GC=Letter_Number} like Roman numerals and also
all the circled letters, which are of type \p{Other_Symbol} and \p{Block=Enclosed_Alphanumerics}.
Aren’t you glad we get to use \w? :)
In particular \w also matches the underscore character.
#!/usr/bin/perl -w
$name = 'Arun_Kumar';
($name =~ /\w+/)? print "Underscore is a word character\n": print "No underscores\n";
$ underscore.pl
Underscore is a word character.