Related
This question already has answers here:
How do you validate that a string is a valid IPv4 address in C++?
(17 answers)
Closed 6 years ago.
I'm currently trying to check the user input to be a valid URL (http(s)://www.abc.at or localhost) or a valid IP-Address (127.0.0.1, ...) using only the standard C++ methods like REGEX. I could use the libraries ASIO (standalone), regex and arpa/inet.h. Is there a way to do that in very simple ways?
Thanks for your help!
Usually you should be wary about doing such validations yourself because they tend to be a lot more complicated than it appears on first glance. For example to validate an IPv4 you cannot just check for “4 numbers separated by dots”. You’ll also have to check things like the range of each number (0-255), special cases like 0.0.0.0, etc. Then what about IPv6? URLs/hostnames aren’t any less complex.
To answer your concrete question: No, there is no simple way to validate an IP/hostname.
Either use a dedicated library for checking or simply try to do whatever it is you want to do with the address and handle errors appropriately. You might consider doing a rough sanity check for obvious errors in the beginning, mainly to provide better error messages to the user. But even that requires a bit of thought. For example, it’s easy to forget about IPv6 and reject perfectly valid addresses.
Just use this:
^https?:\/\/(.+\..{2,10}|localhost|(?:\d{1,3}\.){3}\d{1,3})\/?.*?$
This will match any address which starts with https:// or http://, followed by one of the following three cases:
any characters, followed by a dot . and a TLD with length of 2 up to 10.
localhost
an IP address with 4 segments of numbers with up to 3 characters (does not check the validity of the IP address, does accept 999.999.999.999.
Here is a live example.
I am new to regular expressions and have just started learning some. I was wondering what are some of the most commonly used regular expressions by the programmers. Put it in another way, I would like to know what are the regular expressions most useful for? How can they help me in my every day tasks? I would prefer to know regular expressions useful for every day programming, not occasionally used regular expressions such email address matching.
Anyone? Thanks.
Edit: Most of the answers include regular expressions to match email addresses, URLs, dates, phone numbers etc. Please note that not all programmers have to worry about these things in their every day tasks. I would like to know some more generic uses of regular expressions, if there are any, which programmers in general (may) use regardless what language are domain they are working in.
Regular expression examples for
Decimals input
Positive Integers ^\d+$
Negative Integers ^-\d+$
Integer ^-?\d+$
Positive Number ^\d*\.?\d+$
Negative Number ^-\d*\.?\d+$
Positive Number or Negative Number ^-?\d*\.?\d+$
Phone number ^\+?[\d\s]{3,}$
Phone with code ^\+?[\d\s]+\(?[\d\s]{10,}$
Year 1900-2099 ^(19|20)\d{2}$
Date (dd mm yyyy, d/m/yyyy, etc.)
^([1-9]|0[1-9]|[12][0-9]|3[01])\D([1-9]|0[1-9]|1[012])\D(19[0-9][0-9]|20[0-9][0-9])$
IP v4:
^(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5])\.(\d|[1-9]\d|1\d\d|2[0-4]\d|25[0-5]){3}$
Alphabetic input
Personal Name ^[\w.']{2,}(\s[\w.']{2,})+$
Username ^[\w\d_.]{4,}$
Password at least 6 symbols ^.{6,}$
Password or empty input ^.{6,}$|^$
email ^[_]*([a-z0-9]+(\.|_*)?)+#([a-z][a-z0-9-]+(\.|-*\.))+[a-z]{2,6}$
domain ^([a-z][a-z0-9-]+(\.|-*\.))+[a-z]{2,6}$
Other regular expressions
- Match no input ^$
- Match blank input ^\s\t*$
- Match New line [\r\n]|$
- Match white Space ^\s+$
- Match Url = ^http\:\/\/[a-zA-Z0-9.-]+\.[a-zA-Z]{2,3}$
I would take a different angle on this and say that it's most helpful to know when to use regular expressions and when NOT to use them.
For example, imagine this problem: "Figure out if a string ends with a whitespace character." A regular expression could be used here, but if you're using C#, this code is much faster:
bool EndsWithWhitespace(string s)
{
return !string.IsNullOrEmpty(s) && char.IsWhiteSpace(s[s.Length - 1]);
}
Regular expressions are powerful, and it's important to know when they're too powerful for the problem you're trying to solve.
Think about input fields that require validation, such as zip codes, telephone numbers, et cetera. Regular expressions are very utilized to validate those. Also, take a look at this site, which contains many tutorials, and many more examples, some of which I present next:
Numeric Ranges. Since regular
expressions work with text rather than
numbers, matching specific numeric
ranges requires a bit of extra care.
Matching a Floating Point Number. Also
illustrates the common mistake of
making everything in a regular
expression optional.
Matching an Email Address. There's a
lot of controversy about what is a
proper regex to match email addresses.
It's a perfect example showing that
you need to know exactly what you're
trying to match (and what not), and
that there's always a trade-off
between regex complexity and accuracy.
Matching Valid Dates. A regular
expression that matches 31-12-1999 but
not 31-13-1999.
Finding or Verifying Credit Card
Numbers. Validate credit card numbers
entered on your order form. Find
credit card numbers in documents for a
security audit.
And many, many, many more possible applications.
E-mail address
Website
File-Paths
Phone-numbers/Fax/ZIP and other numbers used in business (chemistry numbers, ect.)
file content (check if the file can be a valid XML-file,...)
code modification and formatting (with replacement)
data types (GUID, parsing of integers,...)
...
Upto closing tag
([^<]*)
Seriously. I use combinations of that way too often for comfort... We should all ditch regex:en for peg-parsers, especially since there's a nice regex-like grammar style for them.
Well... I kind of think your question is wrong. It sounds like you're asking about regular expressions that could/should be as much a part of one's coding, or nearly so, as things like mathematical operators. Really, if your code depends that pervasively on regular expressions, you're probably doing something very wrong. For pervasive use throughout code, you want to use data structures that are better defined and more efficient to work with than regular-expression-managed strings.
The closest thing to what you're asking for that would make much sense to me would be something like /\s+/ used for splitting strings on arbitrary amounts of whitespace.
This is a little like asking 'what are the most useful words for programmers?'
It depends what you're going to use them for, and it depends which language. And you didn't say.
Some programmers never need to worry about matching email addresses, phone numbers, ZIP codes and IP addresses.
My copy of
Mastering Regular Expressions, O'Reilly, 3rd Edition, 2006
devotes a lot of space to the flavours of regex used by different languages.
It's a great reference, but I found the 2nd edition more readable.
How can they help me in my every day tasks?
A daily use for programmers could include
search/replace of sample data for testing purposes
searching through log files for String patterns (Exceptions, for example)
searching a directory structure for files of a certain type (as simple as dir *.txt does this)
to name just a few
E-mail
Website URL
Phone-numbers
ZIP Code
Alpha Numeric, (user name consist of alpha number and only start with alpha character
IP Address
This will be completely dependent on what domain you work in. For some it will be phone numbers and SSN's and others it will be email addresses, IP addresses, URLs. The most important thing is knowing when you need a regex and when you don't. For example, if you're trying to parse data from an XML or HTML file, it's usually better to use a library specifically designed to parse that content than to try and write something yourself.
I was looking at email validation. I read in RFC specs that consecutive . (dot) are not allowed, like, mail..me#server.com.
But are different wild characters allowed to occur consecutively? Like, mail.$me#server.com.
And if so, how do I make a regular expression which will take only single occurance of wild characters as long as they are different? It shouldn't accept the ones like, .. && $$, but accept the ones like, &$ .$ &.
And since there's a big list of wild characters allowed, I don't think a regex like \^(&&|$$|..)\ etc, is not an option.
There are a few RFC compliant email validation regexes. They are not pretty, in fact they are pretty awful, spanning hundreds of characters. You really don't want to create one, either use it or write regular code you can understand and maintain.
This is one of the RFC compliant regexes
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Check this link for expanded information and alternative (more practical) regexes http://www.regular-expressions.info/email.html
I finally used something like this:
/^([a-zA-Z0-9]+([\.\!\'\#\$\%\&\*\+\-\/\=\?\^\_\`\{\|\}\~]{0,1}))*[a-zA-Z0-9]+\#(([a-zA-Z0-9\-]+[\.]?[a-zA-Z0-9]+){0,2})[\.][a-zA-Z]{2,4}$/
Not pretty :)
but very much served my specifications.
Different characters like $ are allowed to occur multiple times in a row, yes. sam$$iam#example.com is a completely valid email address.
I would use a simple regex of email validation + another regex that checks double chars like /[.&$]{2}/
I suppose it depends on what you're doing with this email validation, but I've done this for years in online ASP.NET regex validators for form entry purposes.
For a few months I thought I had what was a pretty cool regular expression to take care of this. I found it online and it seemed to be a popular one. However, on several occasions I'd get a call from a customer trying to fill out the application where the form validation didn't like their email address. And who knows how many people had the same problem but didn't call.
I learned the lesson the hard way that it's better to err on the side of greediness than to try to be too strict. In other words, since there are soooooo many rules in defining what makes an email address valid (and invalid), I simply define a loose open-ended regex to cover all of my bases. It may match some invalid email addresses as well, but for my purposes that's not as big of a deal. Besides, quite honestly -- most of the time if the user is screwing up their email address it's going to be a misspelling which regex isn't going to catch anyways.
So here's what I use now:
^[^<>\s\#]+(\#[\w\-]+(\.[\w\-]+)+)$
And here's a working example to test this:
http://regexhero.net/tester/?id=b90d359f-0dda-4b2a-a9b7-286fc513cf40
This doesn't address your primary concern as this will still match consecutive dots, dashes, etc. And I still can't claim this will match every valid email address because I honestly don't know. But I can say that I've been using it for the past 3 years with over 25,000 users and not a single complaint.
See these answers:
stackoverflow.com/questions/997078/email-regular-expression
stackoverflow.com/questions/201323/what-is-the-best-regular-expression-for-validating-email-addresses
stackoverflow.com/questions/36261/test-expand-my-email-regex
Just remember, as stated before: the only way to tell if an email address is truly valid is to send email to it!
Please don't answer the obvious, but what are the limit signs that tell us a problem should not be solved using regular expressions?
For example: Why is a complete email validation too complex for a regular expression?
Regular expressions are a textual representation of finite-state automata. That is to say, they are limited to only non-recursive matching. This means that you can't have any concept of "scope" or "sub-match" in your regexp. Consider the following problem:
(())()
Are all the open parens matched with a close paren?
Obviously, when we look at this as human beings, we can easily see that the answer is "yes". However, no regular expression will be able to reliably answer this question. In order to do this sort of processing, you will need a full pushdown automaton (like a DFA with a stack). This is most commonly found in the guise of a parser such as those generated by ANTLR or Bison.
A few things to look out for:
beginning and ending tag detection -- matched pairing
recursion
needing to go backwards (though you can reverse the string, but that's a hack)
regexes, as much as I love them, aren't good at those three things. And remember, keep it simple! If you're trying to build a regex that does "everything", then you're probably doing it wrong.
When you need to parse an expression that's not defined by a regular language.
What it comes down to is using common sense. If what you are trying to match becomes an unmanageable, monster regular expression then you either need to break it up into small, logical sub-regular expressions or you need to start re-thinking your solution.
Take email addresses (as per your example). This simple regular expression (taken from RegEx buddy) matches 99% of all emails out there:
\b[A-Z0-9._%+-]+#[A-Z0-9.-]+\.[A-Z]{2,4}\b
It is short and to the point and you will rarely run into issues with it. However, as the author of RegEx buddy points out, if your email address is in the rare top-level domain ".museum" it will not be accepted.
To truely match all email addresses you need to adhere to the standard known as RFC 2822. It outlines the multitude of ways email addresses can be formatted and it is extremely complex.
Here is a sample regular expression attempting to adhere to RFC 2822:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"
(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x
0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9]
(?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.)
{3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08
\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
This obviously becomes a problem of diminishing returns. It is better to use the easily maintained implementation that matches 99% of email addresses vs the monsterous one that accepts 99.9% of them.
Regular expressions are a great tool to have in your programmers toolbox but they aren't a solution to all your parsing problems. If you find your RegEx solution starting to become extremely complex you need to either attempt to logically break it up into smaller regular expressions to match portions of your text or you need to start looking at other methods to solve your problem. Similarly, there are simply problems that Regular Expressions, due to their nature, can't solve (as one poster said, not adhering to Regular Language).
Regular expressions are suited for tokenizing, finding or identifying individual bits of text, e.g. finding keywords, strings, comments, etc. in source code.
Regular expressions are not suited for determining the relationship between multiple bits of text, e.g. finding a block of source code with properly paired braces. You need a parser for that. The parser can use regular expressions for tokenizing the input, while the parser itself determines how the different regex matches fit together.
Essentially, you're going to far with your regular expressions if you start thinking about "balancing groups" (.NET's capture group subtraction feature) or "recursion" (Perl 5.10 and PCRE).
Here's a good quote from Raymond Chen:
Don't make regular expressions do what they're not good at. If you want to match a simple pattern, then match a simple pattern. If you want to do math, then do math. As commenter Maurits put it, "The trick is not to spend time developing a combination hammer/screwdriver, but just use a hammer and a screwdriver.
Source
Solve the problem with a regex, then give it to somebody else conversant in regexes. If they can't tell you what it does (or at least say with confidence that they understand) in about 10 minutes, it's too complex.
Sure sign to stop using regexps is this: if you have many grouping braces '()' and many alternatives '|' then it is a sure sign that you try to do a (complex) parsing with regular expressions.
Add to the mix Perl extensions, backreferences, etc and soon you have yourself a parser that is hard to read, hard to modify, and hard to reason about it's properties (e.g. is there an input on which this parser will work in a exponential time).
This is a time to stop regexing and start parsing (with hand-made parser, parser generators or parser combinators).
Along with tremendous expressions, there are principal limitations on the words, which can be handled by regexp.
For instance you can not not write regexp for word described by n chars a, then n chars b, where n can be any, more strictly .
In different languages regexp is a extension of Regular language, but time of parsing can be extremely large and this code is non-portable.
Whenever you can't be sure it really solves the problem, for example:
HTML parsing
Email validation
Language parsers
Especially so when there already exist tools that solve the problem in a totally understandable way.
Regex can be used in the domains I mentioned, but only as a subset of the whole problem and for specific, simple cases.
This goes beyond the technical limitations of regexes (regular languages + extensions), the maintainability and readability limit is surpassed a lot earlier than the technical limit in most cases.
A problem is too complex for regular expressions when constraints of the problem can change after the solution is written. So, in your example, how can you be sure an email address is valid when you do not have access to the target mail system to verify that the email address is attached to a valid user? You can't.
My limit is a Regex pattern that's about 30-50 characters long (varying depending on how much is fixed text and how much is regex commands)
This may sound stupid but I often lament not being able to do database type of queries using regular expression. Now especially more then before because I am entering those types of search string all the time on search engines. its very difficult, if not impossible to search for +complex AND +"regular expression"
For example, how do I search in emacs for commands that have both Buffer and Window in their name? I need to search separately for .*Buffer.*Window and .*Window.*Buffer
This question already has answers here:
Closed 14 years ago.
Duplicate: Using a regular expression to validate an email address
There seem to be an awful lot of different variants on this on the web and was wondering if there is a definitive answer?
Preferably using the .net (Regex) dialect of regular expressions.
This question has been asked and answered several times:
Using a regular expression to validate an email address
Why are people using regexp for email and other complex validation?
Regexp recognition of email address hard?
Specifically related to .NET:
Validating e-mail with regular expression VB.Net
regular-expressions says that this one matches about 99%
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
The definitive answer? Or the normal answer?
I ask because the formal email address specification allows all sorts of weird things (parenthesis, quoted phrases, etc) that most people don't bother to account for.
See this page for a list of both comprehensive and normal regex'es.
I don´t think there´s a silver bullet for email regex verification.
what people are commonly doing is to verify only for mistakes, like the absence of # and one dot. And then send a email verification to that address. It´s the only way to be sure that they email is actually valid.
I've had the same problem some time ago. RFC 2822 defines it and according to this page this one is useful and is the one i picked: "[a-z0-9!#$%&'+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)#(?:a-z0-9?.)+(?:[A-Z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum)\b"
[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*#(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?
Probably want to add A-Z next to all the lower case versions in order to allow uppercase letters as well.
I don't know if there's one definitive answer for this one, but if you put aside actually checking if the domain exists, email addresses boil down to <username>#<domain>, where <domain> contains at least one dot and two to four characters in the suffix. You can do all kinds of things to check for illegal/special characters, but the simplest one would be:
^[\w-\.]+#([\w-]+\.)+[\w-]{2,4}$