Edit regular expression to support + (plus) notation

Edit regular expression to support + (plus) notation - regex

I'm using a regex to 'quick and dirty' validate an email address client side and I just found out it doesn't support the + plus notation (user+anything#gmail.com) google provides its users. I'm sure it fails in other points as well. How can I edit this to support + notation and ensure I'm dealing with an email address while not pissing anyone with an oddly formed email address off?
`var emailReg = new RegExp(/^(("[\w-\s]+")|([\w-]+(?:\.[\w-]+)*)|("[\w-\s]+")([\w-]+(?:\.[\w-]+)*))(#((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$)|(#\[?((25[0-5]\.|2[0-4][0-9]\.|1[0-9]{2}\.|[0-9]{1,2}\.))((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[0-9]{1,2})\.){2}(25[0-5]|2[0-4][0-9]|1[0-9]`{2}|[0-9]{1,2})\]?$)/i);
Word wrapped:
var emailReg = new RegExp(/^(("[\w-\s]+")|([\w-]+(?:\.[\w-]+)*)|("[\w-\s]+")([\w-]+(?:\.[\w-]+)*))(#((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$)|(#\[?((25[0-5]\.|2[0-4][0-9]\.|1[0-9]{2}\.|[0-9]{1,2}\.))((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[0-9]{1,2})\.){2}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[0-9]{1,2})\]?$)/i);
Thank you,

You could use this:
^(("[\w-\s]+")|([\w-]+(?:[.+][\w-]+)*)|("[\w-\s]+")([\w-]+(?:[.+][\w-]+)*))(#((?:[\w-]+\.)*\w[\w-]{0,66})\.([a-z]{2,6}(?:\.[a-z]{2})?)$)|(#\[?((25[0-5]\.|2[0-4][0-9]\.|1[0-9]{2}\.|[0-9]{1,2}\.))((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[0-9]{1,2})\.){2}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[0-9]{1,2})\]?$)
It's still quick and dirty. It will allow your example user+anything#gmail.com, but will also allow user+anything+else#gmail.com. It won't allow for user++anything#gmail.com or user.+anything#gmail.com.

I just copied your regular expression and removed an extra parenthesis and is working fine to me:
^(("[\w-\s]+")|([\w-]+(?:.[\w-]+))|("[\w-\s]+")([\w-]+(?:.[\w-]+)))(#((?:[\w-]+.)*\w[\w-]{0,66}).([a-z]{2,6}(?:.[a-z]{2})?)$)|(#[?((25[0-5].|2[0-4][0-9].|1[0-9]{2}.|[0-9]{1,2}.)((25[0-5]|2[0-4][0-9]|1[0-9]{2}|[0-9]{1,2}).){2}(25[0-5]|2[0-4][0-9]|1[0-9]{2}|[0-9]{1,2})]?$
Live demo:
https://regex101.com/r/kX3jW0/1

Related

US State Regular expression with case sensitive

I'm using ASP.NET MVC application and model has the following regular expression to validate US states.
This one works fine if user enter all upper case, but not working for lower case/camel case scenarios.
[RegularExpression(#"^((A[ELKSZR])|(C[AOT])|(D[EC])|(F[ML])|(G[AU])|(HI)|(I[DLNA])|(K[SY])|(LA)|(M[EHDAINSOT])|(N[EVHJMYCD])|(MP)|(O[HKR])|(P[WAR])|(RI)|(S[CD])|(T[NX])|(UT)|(V[TIA])|(W[AVIY]))$", ErrorMessage = "Invalid State")]
public string State { get; set; }
I tried this one, but no luck.
// [RegularExpression(#"^(?-i:A[LKSZRAEP]|C[AOT]|D[EC]|F[LM]|G[AU]|HI|I[ADLN]|K[SY]|LA|M[ADEHINOPST]|N[CDEHJMVY]|O[HKR]|P[ARW]|RI|S[CD]|T[NX]|UT|V[AIT]|W[AIVY])$", ErrorMessage = "Invalid State")]
thank you.

Since this expression can be used for client side validation (and thus requires ECMA regex syntax, that is, JavaScript-compatible regular expression) you cannot use an inline modifier like (?i) let alone the toggled version (?i:...).
You have to double each letter with the lowercase counterpart:
^(([Aa][EeLlKkSsZzRr])|([Cc][AaOoTt])|([Dd][EeCc])|([Ff][MmLl])|([Gg][AaUu])|([Hh][Ii])|([Ii][DdLlNnAa])|([Kk][SsYy])|([Ll][Aa])|([Mm][EeHhDdAaIiNnSsOoTt])|([Nn][EeVvHhJjMmYyCcDd])|([Mm][Pp])|([Oo][HhKkRr])|([Pp][WwAaRr])|([Rr][Ii])|([Ss][CcDd])|([Tt][NnXx])|([Uu][Tt])|([Vv][TtIiAa])|([Ww][AaVvIiYy]))$
See demo

The list above is not as exhaustive - it is missing some military abbreviations. Trust me - you do not want to receive the ire of patriotic families trying to send stuff to their loved ones in the military.
Same technique - I added a few more.
^(([Aa][EeLlKkSsZzRr])|([Cc][AaOoTt])|([Dd][EeCc])|([Ff][MmLl])|([Gg][AaUu])|([Hh][Ii])|([Ii][DdLlNnAa])|([Kk][SsYy])|([Ll][Aa])|([Mm][EeHhDdAaIiNnSsOoTt])|([Nn][EeVvHhJjMmYyCcDd])|([Mm][Pp])|([Oo][HhKkRr])|([Pp][WwAaRr])|([Rr][Ii])|([Ss][CcDd])|([Tt][NnXx])|([Uu][Tt])|([Vv][TtIiAa])|([Ww][AaVvIiYy]))$

I have used
[^,]*[A-Z]{2}
hopefully, it works for you.

How to create Gmail filter searching for text only at start of subject line?

We receive regular automated build messages from Jenkins build servers at work.
It'd be nice to ferret these away into a label, skipping the inbox.
Using a filter is of course the right choice.
The desired identifier is the string [RELEASE] at the beginning of a subject line.
Attempting to specify any of the following regexes causes emails with the string release in any case anywhere in the subject line to be matched:
\[RELEASE\]*
^\[RELEASE\]
^\[RELEASE\]*
^\[RELEASE\].*
From what I've read subsequently, Gmail doesn't have standard regex support, and from experimentation it seems, as with google search, special characters are simply ignored.
I'm therefore looking for a search parameter which can be used, maybe something like atstart:mystring in keeping with their has:, in: notations.
Is there a way to force the match only if it occurs at the start of the line, and only in the case where square brackets are included?
Sincere thanks.

Regex is not on the list of search features, and it was on (more or less, as Better message search functionality (i.e. Wildcard and partial word search)) the list of pre-canned feature requests, so the answer is "you cannot do this via the Gmail web UI" :-(
There are no current Labs features which offer this. SIEVE filters would be another way to do this, that too was not supported, there seems to no longer be any definitive statement on SIEVE support in the Gmail help.
Updated for link rot The pre-canned list of feature requests was, er canned, the original is on archive.org dated 2012, now you just get redirected to a dumbed down page telling you how to give feedback. Lack of SIEVE support was covered in answer 78761 Does Gmail support all IMAP features?, since some time in 2015 that answer silently redirects to the answer about IMAP client configuration, archive.org has a copy dated 2014.
With the current search facility brackets of any form () {} [] are used for grouping, they have no observable effect if there's just one term within. Using (aaa|bbb) and [aaa|bbb] are equivalent and will both find words aaa or bbb. Most other punctuation characters, including \, are treated as a space or a word-separator, + - : and " do have special meaning though, see the help.
As of 2016, only the form "{term1 term2}" is documented for this, and is equivalent to the search "term1 OR term2".
You can do regex searches on your mailbox (within limits) programmatically via Google docs: http://www.labnol.org/internet/advanced-gmail-search/21623/ has source showing how it can be done (copy the document, then Tools > Script Editor to get the complete source).
You could also do this via IMAP as described here:
Python IMAP search for partial subject
and script something to move messages to different folder. The IMAP SEARCH verb only supports substrings, not regex (Gmail search is further limited to complete words, not substrings), further processing of the matches to apply a regex would be needed.
For completeness, one last workaround is: Gmail supports plus addressing, if you can change the destination address to youraddress+jenkinsrelease#gmail.com it will still be sent to your mailbox where you can filter by recipient address. Make sure to filter using the full email address to:youraddress+jenkinsrelease#gmail.com. This is of course more or less the same thing as setting up a dedicated Gmail address for this purpose :-)

Using Google Apps Script, you can use this function to filter email threads by a given regex:
function processInboxEmailSubjects() {
var threads = GmailApp.getInboxThreads();
for (var i = 0; i < threads.length; i++) {
var subject = threads[i].getFirstMessageSubject();
const regex = /^\[RELEASE\]/; //change this to whatever regex you want, this one should cover OP's scenario
let isAtLeast40 = regex.test(subject)
if (isAtLeast40) {
Logger.log(subject);
// Now do what you want to do with the email thread. For example, skip inbox and add an already existing label, like so:
threads[i].moveToArchive().addLabel("customLabel")
}
}
}
As far as I know, unfortunately there isn't a way to trigger this with every new incoming email, so you have to create a time trigger like so (feel free to change it to whatever interval you think best):
function createTrigger(){ //you only need to run this once, then the trigger executes the function every hour in perpetuity
ScriptApp.newTrigger('processInboxEmailSubjects').timeBased().everyHours(1).create();
}

The only option I have found to do this is find some exact wording and put that under the "Has the words" option. Its not the best option, but it works.

I was wondering how to do this myself; it seems Gmail has since silently implemented this feature. I created the following filter:
Matches: subject:([test])
Do this: Skip Inbox
And then I sent a message with the subject
[test] foo
And the message was archived! So it seems all that is necessary is to create a filter for the subject prefix you wish to handle.

preg match email and name from to

i want to find name and email from following formats (also if you know any other format that been getting use in mail application for sending emails, please tell in comment :))
how can i know name and email for following format strings (its one string and can be in any following format):
- jon435#hotmail.com
- james jon435#hotmail.com
- "James Jordan" <jon435#hotmail.com> (gmail format)
- janne - jon44#hotmail.com (possible format)

The answer is straightforward, at least for the email portion. The rest can be special-cased away.
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
Proof I'm not insane.

If you only have those strings, it is going to require more work than a simple regular expression. For instance, your first example doesn't include the full name, it is only the e-mail, thus, you would have to use the Microsoft Live ID API to retrieve that information...and that turns out to be really hard.
What exactly are you trying to do? Perhaps there is another way?

What's the best way to validate a user-entered URL in a Cocoa application?

I am trying to build a homebrew web brower to get more proficient at Cocoa. I need a good way to validate whether the user has entered a valid URL. I have tried some regular expressions but NSString has some interesting quirks and doesn't like some of the back-quoting that most regular expressions I've seen use.

You could start with the + (id)URLWithString:(NSString *)URLString method of NSURL, which returns nil if the string is malformed.
If you need further validation, you can use the baseURL, host, parameterString, path, etc methods to give you particular components of the URL, which you can then evaluate in whatever way you see fit.

I've found that it is possible to enter some URLs that seem to be OK but are rejected by the NSURL creation methods. So we have a method to escape the string first to make sure it's in a good format. Here is the meat of it:
NSString *escapedURLString =
NSMakeCollectable(CFURLCreateStringByAddingPercentEscapes(NULL,
(CFStringRef)URLString,
(CFStringRef)#"%+#", // Characters to leave unescaped
NULL,
kCFStringEncodingUTF8));

Does this set of regular expressions FULLY protect against cross site scripting?

What's an example of something dangerous that would not be caught by the code below?
EDIT: After some of the comments I added another line, commented below. See Vinko's comment in David Grant's answer. So far only Vinko has answered the question, which asks for specific examples that would slip through this function. Vinko provided one, but I've edited the code to close that hole. If another of you can think of another specific example, you'll have my vote!
public static string strip_dangerous_tags(string text_with_tags)
{
string s = Regex.Replace(text_with_tags, #"<script", "<scrSAFEipt", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"</script", "</scrSAFEipt", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"<object", "</objSAFEct", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"</object", "</obSAFEct", RegexOptions.IgnoreCase);
// ADDED AFTER THIS QUESTION WAS POSTED
s = Regex.Replace(s, #"javascript", "javaSAFEscript", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onabort", "onSAFEabort", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onblur", "onSAFEblur", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onchange", "onSAFEchange", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onclick", "onSAFEclick", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"ondblclick", "onSAFEdblclick", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onerror", "onSAFEerror", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onfocus", "onSAFEfocus", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onkeydown", "onSAFEkeydown", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onkeypress", "onSAFEkeypress", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onkeyup", "onSAFEkeyup", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onload", "onSAFEload", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmousedown", "onSAFEmousedown", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmousemove", "onSAFEmousemove", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmouseout", "onSAFEmouseout", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmouseup", "onSAFEmouseup", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onmouseup", "onSAFEmouseup", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onreset", "onSAFEresetK", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onresize", "onSAFEresize", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onselect", "onSAFEselect", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onsubmit", "onSAFEsubmit", RegexOptions.IgnoreCase);
s = Regex.Replace(s, #"onunload", "onSAFEunload", RegexOptions.IgnoreCase);
return s;
}

It's never enough – whitelist, don't blacklist
For example javascript: pseudo-URL can be obfuscated with HTML entities, you've forgotten about <embed> and there are dangerous CSS properties like behavior and expression in IE.
There are countless ways to evade filters and such approach is bound to fail. Even if you find and block all exploits possible today, new unsafe elements and attributes may be added in the future.
There are only two good ways to secure HTML:
convert it to text by replacing every < with <.
If you want to allow users enter formatted text, you can use your own markup (e.g. markdown like SO does).
parse HTML into DOM, check every element and attribute and remove everything that is not whitelisted.
You will also need to check contents of allowed attributes like href (make sure that URLs use safe protocol, block all unknown protocols).
Once you've cleaned up the DOM, generate new, valid HTML from it. Never work on HTML as if it was text, because invalid markup, comments, entities, etc. can easily fool your filter.
Also make sure your page declares its encoding, because there are exploits that take advantage of browsers auto-detecting wrong encoding.

You're much better off turning all < into < and all > into >, then converting acceptable tags back. In other words, whitelist, don't blacklist.

As David shows, there's no easy way to protect with just some regexes you can always forget something, like javascript: in your case. You better escape the HTML entities on output. There is a lot of discussion about the best way to do this, depending on what you actually need to allow, but what's certain is that your function is not enough.
Jeff has talked a bit about this here.

example
Any time you can write a string to the document, a big door swings open.
There are myriad places to inject malicious things into HTML/JavaScript. For this reason, Facebook didn't initially allow JavaScript in their applications platform. Their solution was to later implement a markup/script compiler that allows them to seriously filter out the bad stuff.
As said already, whitelist a few tags and attributes and strip out everything else. Don't blacklist a few known malicious attributes and allow everything else.

Although I can't provide a specific example of why not, I am going to go ahead and outright say no. This is more on principal. Regex's are an amazing tool but they should only be used for certain problems. They are fantastic for data matching and searching.
They are not however a good tool for security. It is too easy to mess up a regex and have it be only partially correct. Hackers can find lots of wiggle room inside a poorly or even well constructed regex. I would try another avenue to prevent cross site scripting.

Take a look at the XSS cheatsheet at http://ha.ckers.org/xss.html it's not a complete list but a good start.
One that comes to mind is <img src="http://badsite.com/javascriptfile" />
You also forgot onmouseover, and the style tag.
The easiest thing to do really is entity escaping. If the vector can't render properly in the first place, an incomplete blacklist won't matter.

As an example of an attack that makes it through this:
<div style="color: expression('alert(4)')">
Shameless plug:
The Caja project defines whitelists of HTML elements and attributes so that it can control how and when scripts in HTML get executed.
See the project at http://code.google.com/p/google-caja/
and the whitelists are the JSON files in
http://code.google.com/p/google-caja/source/browse/#svn/trunk/src/com/google/caja/lang/html
and
http://code.google.com/p/google-caja/source/browse/#svn/trunk/src/com/google/caja/lang/css

I still have not figured out why developers want to massage bad input into good input with a regular expression replace. Unless your site is a blog and needs to allow embedded html or javascript or any other sort of code, reject the bad input and return an error. The old saying is Garbage In - Garbage Out, why would you want to take in a nice steaming pile of poo and make it edible?
If your site is not internationalized, why accept any unicode?
If your site only does POST, why accept any URL encoded values?
Why accept any hex? Why accept html entities? What user inputs '&#x0A' or '&ampquot;' ?
As for regular expressions, using them is fine, however, you do not have to code a separate regular expression for the full attack string. You can reject many different attack signatures with just a few well constructed regex patterns:
patterns.put("xssAttack1", Pattern.compile("<script",Pattern.CASE_INSENSITIVE) );
patterns.put("xssAttack2", Pattern.compile("SRC=",Pattern.CASE_INSENSITIVE) );
patterns.put("xssAttack3", Pattern.compile("pt:al",Pattern.CASE_INSENSITIVE) );
patterns.put("xssAttack4", Pattern.compile("xss",Pattern.CASE_INSENSITIVE) );
<FRAMESET><FRAME SRC="javascript:alert('XSS');"></FRAMESET>
<DIV STYLE="width: expression(alert('XSS'));">
<LINK REL="stylesheet" HREF="javascript:alert('XSS');">
<IMG SRC="jav ascript:alert('XSS');"> // hmtl allows embedded tabs...
<IMG SRC="jav
ascript:alert('XSS');"> // hmtl allows embedded newline...
<IMG SRC="jav
ascript:alert('XSS');"> // hmtl allows embedded carriage return...
Notice that my patterns are not the full attack signature, just enough to detect if the value is malicious. It is unlikely that a user would enter 'SRC=' or 'pt:al' This allows my regex patterns to detect unknown attacks that have any of these tokens in them.
Many developers will tell you that you cannot protect a site with a blacklist. Since the set of attacks is infinite, that is basically true, however, if you parse the entire request (params, param values, headers, cookies) with a blacklist constructed based on tokens, you will be able to figure out what is an attack and what is valid. Remember, the attacker will most likely be shotgunning exploits at you from a tool. If you have properly hardened your server, he will not know what environment you are running and will have to blast you with lists of exploits. If he pesters you enough, put the attacker, or his IP on a quarantine list. If he has a tool with 50k exploits ready to hit your site, how long will it take him if you quarantine his id or ip for 30 min for each violation? Admittedly there is still exposure if the attacker uses a botnet to multiplex his attack. Still your site ends up being a much tougher nugget to crack.
Now having checked the entire request for malicious content you can now use whitelist type checks against length, referential/ logical, naming to determine validity of the request
Don't forget to implement some sort of CSRF protection. Maybe a honey token, and check the user-agent string from previous requests to see if it has changed.

Whitespace makes you vulnerable. Read this.

Another vote for whitelisting. But it looks like you're going about this the wrong way. The way I do it, is to parse the HTML into a tag tree. If the tag you're parsing is in the whitelist, give it a tree node, and parse on. Same goes for its attributes.
Dropped attributes are just dropped. Everything else is HTML-escaped literal content.
And the bonus of this route is because you're effectively regenerating all the markup, it's all completely valid markup! (I hate it when people leave comments and they screw up the validation/design.)
Re "I can't whitelist" (para): Blacklisting is a maintenance-heavy approach. You'll have to keep an eye on new exploits and make sure your covered. It's a miserable existence. Just do it right once and you'll never need to touch it again.

From a different point of view, what happens when someone wants to have 'javascript' or 'functionload' or 'visionblurred' in what they submit? This can happen in most places for any number of reasons... From what I understand, those will become 'javaSAFEscript', 'functionSAFEload' and 'visionSAFEblurred'(!!).
If this might apply to you, and you're stuck with the blacklist approach, be sure to use the exact matching regexes to avoid annoying the user. In other words, be at the optimum point between security and usability, compromising either as little as possible.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js