I found the regex below that I'm using to validate password complexity. How can I modify it to include these characters -_+=#^~ ?
current regex
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[$#$!%*?&])[A-Za-z\d$#$!%*?&]{8,}
conditions
Minimum eight characters, at least one uppercase letter, one lowercase letter, one number and one special character
You can include those special characters in the character classes:
^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)(?=.*[-_+=#^~$#$!%*?&])[\w+=^$#$!%*?&~-]{8,}$
RegEx Demo
Just remember to keep unescaped hyphen either at start or at the end of the character class and keep ^ in the middle to avoid interpreting it as negation.
Brief
I see these types of questions get posted here all the time, especially with the javascript tag.
The way you're validating passwords is actually very wrong. Don't limit the passwords to a specific set of characters. You're making hackers' jobs extremely easy. How many iterations of the characters abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789-_+=#^~$#!%*?& are there? Yes, a lot, but limiting the characters to that set reduces the number of iterations possible. Your character set includes 76 characters.
Now let's do some math. 76 characters, password of length 8 (let's be honest, even though we don't like to admit it, most users use a password that's as short as possible, so 8 characters in your case). That means there are 760,269,225,744,000 possible permutations of those characters.
Great! Now what? Adding one more character to the set (77 characters instead of 76) we now get 848,416,382,352,000 permutations (+88,147,156,608,000 permutations). One more (78) yields 945,378,254,620,800 (+96,961,872,268,800 permutations) etc. As you can see, adding one more character to the set increases the number of permutations exponentially.
Whilst adding additional characters to your set may not actually increase password strength (users may still use e in the password instead of è), it at least gives users the option to try to make their passwords stronger.
According to OWASP (the Open Web Application Security Project) - a worldwide not-for-profit organization focused on improving the security of software (from their article on Password Storage Cheat Sheet):
Do not limit the character set and set long max lengths for credentials
Some organizations restrict the
types of special characters
length of credentials accepted by systems because of their inability to prevent SQL Injection, Cross-site scripting,
command-injection and other forms of injection attacks. These
restrictions, while well-intentioned, facilitate certain simple
attacks such as brute force.
Do not allow short or no-length passwords and do not apply character
set, or encoding restrictions on the entry or storage of credentials.
Continue applying encoding, escaping, masking, outright omission, and
other best practices to eliminate injection risks.
A reasonable long password length is 160. Very long password policies
can lead to DOS in certain circumstances.
An interesting read: Think you have a strong password? Hackers crack 16-character passwords in less than an HOUR.
Code
All that being said, I understand trying to help users in the creation of a strong password. For that you can use the following regex (note that not all regex flavours support this, but most languages will support some form of Unicode support, this will need to be adapted for those languages). Also note that this should be run server-side only as doing so client-side exposes information about your password requirements in plain-sight to any hackers (yes, it's still possible for them to figure it out by creating an account and trying to use easy passwords, but it still means they have to put a little bit of effort into figuring out what is and is not allowed):
^(?=.*\p{Ll})(?=.*\p{Lu})(?=.*\p{N})(?=.*[^\p{L}\p{N}\p{C}]).{8,}$
Explanation
^ Assert position at the start of the line
(?=.*\p{Ll}) Positive lookahead ensuring at least one lowercase letter (in any language/script) exists
(?=.*\p{Lu}) Positive lookahead ensuring at least one uppercase letter (in any language/script) exists
(?=.*\p{N}) Positive lookahead ensuring at least one number (any any language/script) exists
(?=.*[^\p{L}\p{N}\p{C}]) Positive lookahead ensuring at least one character that isn't a letter, number or control character (in any language/script) exists
.{8,} Match any character 8 or more times
$ Assert position at the end of the line
Related
My SVP wants me to update our regular expression rules on our email system to better detect US bank account numbers. The issue is that bank account numbers in the US are not standardized, they can be between 6 and 17 digits.
We currently use qualifying terms to detect specific strings that we have identified as needing to be blocked. Our current rules are variations of this:
(?i)bank\saccount\s[0-9]{6,17}
The issue that I need to solve is the need to detect the numbers even if they are prepended or followed with bank account. I know I can find a single example with this:
(?=.*?(bank\saccount))(?=.*?(\d{6,17}))
But my SVP also wants to be able to detect the number of account numbers in a particular message. I've tried adding a third capture group with a greedy quantifier so that it grabs a different number than the second:
(?=.*?(bank\saccount))(?=.*?(\d{6,17}))(?=.*(\d{6,17}))
Here is a sandbox with a couple of examples:
https://regex101.com/r/hqIEaR/3
I am new to regex, is there a way to set up this expression to return a number of matches equal to the instances of 6-17 digit numbers in a message where the string "bank account" is present?
Maybe simpler is better:
(?<=\D|^)\d{6,17}(?=\D|$)
Test here.
The idea is that you find all numbers with 6..17 digits. They are probably account numbers.
The problem is that looking for "bank account" is useless. Your statement is:
The issue that I need to solve is the need to detect the numbers even if they are not prepended by "bank account ".
So if that string may or may not be there, just ignore it completely.
How can you differentiate between an account number and a SSN? That is the topic for another question.
If "bank account" AND the numbers must be found, but with no clear relationship between them (considering their location in the text), I would actually use two searches:
a search for bank account;
If the first search succeeds, a second search for the numbers.
I expect (no proof) that it will be even faster than doing it entirely in regex, since many things will not be done at all.
Since you are using PCRE compatible engine, you may use a regex like
(?is)(?:\G(?!\A)|\A(?=.*\bbank\saccount\b)).*?\K\b\d{6,17}\b
See the regex demo.
(?is) - case insensitive and singleline/dotall modes on
(?:\G(?!\A)|\A(?=.*\bbank\saccount\b)) - either the end of the previous match or start of string (\A) that has a blank account whole word anywhere to the right of the current location (a (?=.*\bbank\saccount\b) positive lookahead)
.*? - any 0+ chars, as few as possible
\K - match reset operator that discards the text matched so far from the overall match memory buffer
\b\d{6,17}\b - 6 to 17 digits matched as whole words (no other letters, digits or _ chars can appear on both ends).
I have tried finding a full list of patterns to use for verifying input via HTML5 form verification for various types, specifically url, email, tel and such, but I couldn't find any. Currently, the built-in versions of these input verifications are far from perfect (and tel doesn't even check if the thing you're entering is a phone number). So I was wondering, which patterns could I use for verifying the user is entering the right format in the inputs?
Here are a few examples of cases where the default verification allows input that is not supposed to be allowed:
type="email"
This field allows emails that have incorrect domains after the #, and it allows addresses to start or end with a dash or period, which isn't allowed either. So, .example-#x is allowed.
type="url"
This input basically allows any input that starts with http:// (Chrome) and is followed by anything other than a few special characters such as those that have a function in URLs (\, #, #, ~, etc). In FF, all that's checked is if it starts with http:, followed by anything other than : (even just http: is allowed in FF). IE does the same as FF, except that it doesn't disallow http::.
For example: http://. is allowed in all three. And so is http://,.
type="tel"
There currently is no built-in verification for phone numbers in any of the major browsers (it functions 100% the same as a type="text", other than telling mobile browsers which kind of keyboard to display.
So, since the browsers don't show a consistent behaviour in each of these cases, and since the behaviour they do show is very basic with many false positives, what can I do to verify my HTML forms (still using HTML5 input verification)?
PS: I'm posting this because I would find it useful to have a complete list of form verification patterns myself, so I figured it might be useful for others too (and of course others can post their solutions too).
These patterns aren't necessarily simple, but here's what I think works best in every situation. Keep in mind that (quite recently) Internationalized Domain Names (IDNs) are available too. With that, an un-testable amount of characters are allowed in URLs (there still exist lots of characters that aren't allowed in domain names, but the list of allowed characters is so big, and will change so often for different Top-Level Domains, that it's not practical to keep up with them). If you want to support the internationalized domain names, you should use the second URL pattern, otherwise, use the first.
##TL;DR:
Here's a live demo to see the following patterns in action. Scroll down for an explanation, reasoning and analysis of these patterns.
URLs
https?:\/\/(?![^\/]{253}[^\/])((?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(\/.*)?
https?:\/\/(?!.{253}.+$)((?!-.*|.*-\.)([^ !-,\.\/:-#\[-`{-~]{1,63}\.)+([^ !-\/:-#\[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9])\.){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(\/.*)?
Emails
(?!(^[.-].*|[^#]*[.-]#|.*\.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+\/=?^_`{|}~.-]+#)(?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}
Phone numbers
((\+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((\+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Western-style names
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
##URLs, without IDN support
https?:\/\/(?![^\/]{253}[^\/])((?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}|((1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5]))\.){3}(1[0-9]{2}|[1-9]?[0-9]|2([0-4][0-9]|5[0-5])))(\/.*)?
Explanation:
DNSes
URLs should always start with http:// or https://, since we don't want links to other protocols.
Domain names should not start or end with -
Domain names can be a maximum of 63 characters each (so a maximum of 63 characters between each dot), and the total length (including dots) cannot exceed 253 (or 255? be safe and bet on 253.) characters [1].
Non-IDNs can only support the letters of the Latin alphabet, the numbers 0 through 9, and a dash.
Top-level domains of non-IDNs only contain at least the letters of the Latin alphabet [2].
I've set an arbitrary limit of 15 letters, since there are currently no domains that exceed 13 characters (".international"), which most likely won't change any time soon.
IPs
Special cases such as 0.0.0.0, 127.0.0.1, etc. are not checked for
IPs that have padded zeroes in them are not allowed (for example 01.1.1.1) [4].
IP numbers can only go from 0 through 255. 256 is not allowed.
Note that the default http:.* pattern built into modern browsers will always be enforced, so even if you remove the https?:// at the start in this pattern, it will still be enforced. Use type="text" to avoid it.
##URLs, with IDN support
https?:\/\/(?!.{253}.+$)((?!-.*|.*-\.)([^ !-,\.\/:-#\[-`{-~]{1,63}\.)+([^ !-\/:-#\[-`{-~]{2,15}|xn--[a-zA-Z0-9]{4,30})|(([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9])\.){3}([01]?[0-9]{2}|2([0-4][0-9]|5[0-5])|[0-9]))(\/.*)?
Explanation:
Since there is a huge amount of characters that are allowed in IDNs, it's not practically possible to list every possible combination in a HTML attribute (you'd get a huge pattern, so in that case it's much better to test it by some other method than regex) [5].
Disallowed characters in domain names are: !"#$%&'()*+, ./ :;<=>?# [\]^_`` {|}~ with the exception of a period as domain seperator.
These are matched in the ranges [!-,] [\.\/] [:-#] [\[-``] [{-~].
All other characters are allowed in this input field
TLDs are allowed to have the same letters in them, up to an arbitrary limit of 15 characters (like with the non-IDN URLs).
Alternatively, TLDs can be of the format xn--* with * being an encoded version of the actual TLD. This encoding uses 2 Latin letters or Arabic numerals per original character, so the arbitrary limit here is doubled to 30.
##Email addresses
(?!(^[.-].*|[^#]*[.-]#|.*\.{2,}.*)|^.{254}.)([a-zA-Z0-9!#$%&'*+\/=?^_`{|}~.-]+#)(?!-.*|.*-\.)([a-zA-Z0-9-]{1,63}\.)+[a-zA-Z]{2,15}
Explanation:
Since email addresses require a whole lot more than this pattern to be 100% foolproof, this will cover the near full 100% of them. A 100% complete pattern does exist, but contains PCRE (PHP)-only regex lookaheads, so it won't work in HTML forms.
Email addresses can only contain letters of the Latin alphabet, the numbers 0-9, and the characters in !#$%&'*+\/=?^_``{|}~.- [6].
Accents are not universally supported [7], but if needed, post a comment, and I could perhaps write a version that meets the RFC 6530 standard.
The local part (before the # can only be 63 characters long, and the total address can only be 254 characters long [8].
Addresses may not start or end with a - or ., and no two dots may appear consecutively [8].
The domain may not be an IP address [9].
Other than that, I only included the non-IDN part of the pattern. IDNs are allowed too though, so those will result in false negatives.
##Phone numbers
((\+|00)?[1-9]{2}|0)[1-9]( ?[0-9]){8}
((\+|00)?[1-9]{2}|0)[1-9]([0-9]){8}
Explanation:
Phone numbers must start with one of the following, where [CTRY] stands for the country code, and X stands for the first non-zero digit (such as 6 in mobile numbers),
00[CTRY]X
+[CTRY]X
0X
[CTRY]X (This is not officially correct syntax, but Chrome Autofill seems to like it for some reason.)
Spaces are allowed between the digits (see the second pattern for the space-less version), except before the non-zero X as defined above.
Phone numbers must be exactly 9 digits long, other than the part before the first non-zero X as defined above.
This regex is just for 10-digit phone numbers. Since phone number lengths may vary between countries, it's best to use a less strict version of this pattern, or modify it to work for the desired countries. So, this pattern should generally be used as a kind of template pattern.
##Extra: Western-style names
([A-ZΆ-ΫÀ-ÖØ-Þ][A-ZΆ-ΫÀ-ÖØ-Þa-zά-ώß-öø-ÿ]{1,19} ?){1,10}
Yes, I know, I'm very western-centric, but this may be useful too, since it might be difficult to make this too, and in case you're making a site for western people too, this will always work (Asian names have a representation in exactly this format too).
All names must start with an uppercase letter
Uppercase letters may occur in the middle of names (such as John McDoe)
Names must be at least 2 letters long
I've set an arbitrary maximum of 10 names (these people probably won't mind), each of which can be at most 20 letters long (the length of "Werbenjagermanjensen", who happens to be #1).
Latin and Greek letters are allowed, including all accented Latin and Greek letters (list) and Icelandic letters (ÐÞ ðþ):
A-Z matches all uppercase Latin letters: ABCDEFGHIJKLMNOPQRSTUVWXYZ
Ά-Ϋ matches all uppercase Greek letters, including the accented ones: Ά·ΈΉΊΌΎΏΐ ΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩ ΪΫ.
À-ÖØ-Þ matches all uppercase accented Latin letters, and the Ð and Þ: ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞ. In between there's also the character × (between Ö and Ø), which is left out this way.
a-z matches all lowercase Latin letters: abcdefghijklmnopqrstuvwxyz
ά-ώ matches all lowercase Greek letters, including the accented ones: άέήίΰαβγδεζηθικλμνξοπρςστυφχψωϊϋόύώ
ß-öø-ÿ matches all lowercase accented Latin letters, and the ß, ð and þ: ßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ. In between there's also the character ÷ (between ö and ø), which is left out this way.
##References
https://en.wikipedia.org/wiki/Domain_Name_System#Domain_name_syntax → https://www.rfc-editor.org/rfc/rfc1034#section-3.1
https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains / https://www.icann.org/resources/pages/tlds-2012-02-25-en
https://en.wikipedia.org/wiki/Domain_name#Technical_requirements_and_process / What are the allowed characters in a subdomain?
Based on the fact neither browsers nor the Windows cmd line allow the padded format.
What are the allowed characters in a subdomain? → http://www.domainnameshop.com/faq.cgi?id=8&session=106ee5e67d523298
https://en.wikipedia.org/wiki/Email_address#Local_part / What characters are allowed in an email address?
https://en.wikipedia.org/wiki/Email_address#Internationalization
https://en.wikipedia.org/wiki/Email_address#Syntax → https://www.rfc-editor.org/rfc/rfc5321#section-4.5.3.1
Sending Email using IP Address instead of Domain Name
One of our clients validates email addresses in their own software prior to firing it via an API call to our system. The issue is however that their validation rules do not match those our system, therefore they are parsing and accepting addresses which break our rules. This is causing lots of failed calls.
They are parsing stuff like "dave#-whatever.com", this goes against RFC 952/RFC 1123 rules as it begins with a hyphen. They have asked that we provide them with our regex list so they can update validation on their platform to match ours.
So, I need to find/build an RFC 952/RFC 1123 accepted. I found this in another SO thread (i'm a lurker :)), would it be suitable and prevent these illegal domains from being sent?
"^(([a-zA-Z]|[a-zA-Z][a-zA-Z0-9\-]*[a-zA-Z0-9])\.)*([A-Za-z]|[A-Za-z][A-Za-z0-9\-]*[A-Za-z0-9])$";
A domain part has a max length of 255 characters and can only consist of digits, ASCII characters and hyphens; a hyphen cannot come first.
Checking the validity of one domain component can be done using this regex, case insensitive, length notwithstanding:
[a-z0-9]+(-[a-z0-9]+)*
This is the normal* (special normal*)* pattern again, with normal being [a-z0-9] and special being -.
Then you take all this in another normal* (special normal*)* pattern as the normal part, and the special being ., and anchor it at the beginning and end:
^[a-z0-9]+(-[a-z0-9]+)*(\.[a-z0-9]+(-[a-z0-9]+)*)+$
If you cannot afford case insensitive matching, add A-Z to the character class.
But please note that it won't check for the max length of 255. It may be done using a positive lookahead, but the regex will become very complicated, and it is shorter to be using a string length function ;)
I have a RegEx here and I need to know if it will 100% omit any bad email addresses but I do not understand them fully so need to call on the community experts.
The string is as follows:
^[_a-zA-Z0-9-]+(.[_a-zA-Z0-9-]+)*#[a-zA-Z0-9-]+(.[a-zA-Z0-9-]+)*(.[a-zA-Z]{2,3})$
Thank you in advance!
Please, please, don't try to validate email addresses using regular expressions; this is a wheel that does not need re-inventing, and unless you write a horrendously hairy regular expression, you will let through invalid email addresses or reject valid ones.
There are plenty of modules on CPAN like Email::Valid which will take care of it all for you and are tried-and-tested.
Simple example:
use Email::Valid;
print (Email::Valid->address('someone#example.com') ? 'yes' : 'no');
Much simpler, and will just work.
Alternatively, using Mail::RFC822::Address:
if (Mail::RFC822::Address::valid('someone#example.com')) { ...}
For an example of how hairy a regular expression would have to be to successfully handle all RFC822-compliant addresses, take a look at this beauty.
People who try to hand-roll their own email address validation tend to end up with code that lets syntactically-invalid addresses slip through, and perhaps worse, reject perfectly valid addresses.
For example, some people use + in their address, like bob+amazon#example.com - this is known as an "address tag" or "sub-addressing". Quite a few naive attempts at validation would refuse that, and the customer will end up going elsewhere.
Also, in the past some people used to assume the TLD would always be 2 or 3 characters; when e.g. .info was launched, people with addresses at those domains would be told their perfectly-valid email address wasn't acceptable.
Finally, there are some pathological cases such as "Mickey Mouse"#example.com, bob#[1.2.3.4] which are syntactically-valid, but most people's hand-rolled validation would refuse.
^[_a-zA-Z0-9-]+(.[_a-zA-Z0-9-]+)*#[a-zA-Z0-9-]+(.[a-zA-Z0-9-]+)*(.[a-zA-Z]{2,3})$
Piece by piece
^ Start of the string
[_a-zA-Z0-9-]+ One or more characters of "_" (no quotes), a letter (a-z, A-Z), a number (0-9), or "-" (no quotes)
(.[_a-zA-Z0-9-]+)* zero or more substrings of type .something, or .123, or .a123. The substring must be formed by a . and a letter (same group of letters as before). So "." is not valid. ".a" or ".1" or ".-" is.
(up until now it will accept for example my.name12 or my.name12.surname34)
# a "#" (like max#something)
[a-zA-Z0-9-]+ One or more characters with the same pattern as before
(.[a-zA-Z0-9-]+)* Zero or more substrings of type ".something"... just as before
(.[a-zA-Z]{2,3}) A "." (dot) and 2 or 3 letters (a-z or A-Z)
$ The end of the string
So we have an email address, where you can't have something.#somethingelse.ss (no "dangling" dot before the #) or .something#somethingelse.ss (no beginning dot). The domain must start with a letter and can't have a dot just before the first level domain (.com/.uk/??), so no something#x..com. The first-level domain must have 2 or 3 letters (no numbers)
There is an error, the . (dot) must be escaped, so it should be \. . Depending on the language, the \ must be escaped in a string (so it could be \\.)
If I see it correctly, the following would be valid according to your regex: a#a#a#a#aa
The dot is the sign for any character!
Additionally, the following valid email address would not be accepted, although it should:
Someone%special#domain.de
Simple answer: it won't.
Next to the fact that a bad email address doesn't necessarily imply it's wrongly formatted (this_email_address_does_not_exist#someprovider.com is rightly formatted but is still bad), the RegEx will accept some bad addresses as well.
For example, the most right-hand part ((.[a-zA-Z]{2,3})$) states the verified string should end with a dot and then two or three letters. This will accept non-existing top level domain names (e.g. .aa) and will block four-letter TLD's (e.g. .info)
This RegEx will accept email addresses beginning with an underscore. That is (mostly) unacceptable.
You haven't placed any minimum limit on the size of the "username" (i.e. the part below "#" symbol). Thus, single character usernames will bypass this. Combined with the previous exception, email-ids of type _#something.com might escape undetected.
The . (dot) operator accepts any character. So, after the "#" part, (invalid) domains of type ##.com etc might be undetected.
Domains with only 2 or 3 chars are accepted, rest are ignored.
[_a-zA-Z0-9-]
Means you only want these characters (any alphanumeric char or '-' or '_') in your email address but it can be valid with all these characters : ! # $ % & ' * + - / = ? ^ _ ` { | } ~
The first part (before #) must be 253 characters long at most ({1,253}) and the second part (after #) can be 64 characters long max ({4,64}). (Add parenthesis to the first or second group before putting the ({4,64}) count limit)
If you want to know the EmailAddress Norm, just look wikipedia : The Article On Wiki
No, it will not exclude 100% of bad email addresses. Short of rejecting all addresses, this is impossible for a regex to accomplish because the vast majority of syntactically-valid addresses are for accounts which do not exist, such as shgercnhlch#stackoverflow.com.
The only way to truly verify the legitimacy of an email address is to attempt to send mail to it - and even that will only tell you that mail is accepted at that address, not that it is received by a human (as opposed to being fed to a script or silently discarded) and, even if it is received by a human, you have no guarantee that it's the human who claimed to own it. ("You insist that I have to give you a deliverable email address? Fine. My email address is president#whitehouse.gov.")
perhaps this regular expression will do?
^[_A-Za-z0-9-\+]+(\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\.[A-Za-z0-9]+)*(\.[A-Za-z]{2,})$
taken from
http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
To all the writers above that identify that the . accepts any character, I have found that in writing a response to another RegEx question, this edit-capture widget eats backslashes.
(IT'S A PROBLEM!)
Ok... Let's write it correctly:
^\s*([_a-zA-Z0-9]+(\\.[_a-zA-Z0-9\\-\\%]+)\*)#([a-zA-Z0-9]+(\\.[a-zA-Z0-9\\-]+)\*(\\.[a-zA-Z]{2,4}))\s*$
This also incorporates the % character as an allowed-inside value. The problem with this routine is that while it actually does a pretty good job parsing email addresses, it also is not very efficient, since RegEx is "greedy" and the terminating condition (which is supposed to match things like .com and .edu) will overshoot, then need to backtrack, costing considerable CPU time.
The real answer is to use the routines that are specific to this, as other posters have recommended. But if you don't have the CPAN modules, or the target environment does not, then the RegEx hack is arguably acceptable.
I'm looking to match Twitter syntax with a regex.
How can I match anything that is "#______" that is, begins with an # symbol, and is followed by no spaces, just letters and numbers until the end of the word? (To tweeters, I want to match someone's name in a reply)
Go for
/#(\w+)/
to get the matching name extracted as well.
#\w+
That simple?
It should be noted that Twitter no longer allows usernames longer than 15 characters, so you can also match with:
#\w{1,15}
There are still apparently a few people with usernames longer than 15 characters, but testing on 15 would be better if you want to exclude likely false positives.
There are apparently no rules regarding whether underscores can be used the the beginning or end of usernames, multiple underscores, etc., and there are accounts with single-letter names, as well as someone with the username "_".
#[\d\w]+
\d for a digit character
\w for a word character
[] to denote a character class
+ to represent more than one instances of the character class
Note that these specifiers for word and digit characters are language dependent. Check the language specification to be sure.
There is a very extensive API for how to get valid twitter names, mentions, etc. The Java version of the API provided by Twitter can be found on github twitter-text-java. You may want to take a look at it to see if this is something you can use.
I have used it to validate Twitter names and it works very well.