Why does URLEncodedFormat() in CFML encodes valid URL characters? - coldfusion

What are the reasons behind URLEncodedFormat() escaping valid URL characters?
valid characters:
- _ . ! ~ * " ( )
The CF8 Doc said, "[URLEncodedFormat() escapes] non-alphanumeric characters with equivalent hexadecimal escape sequences." However, why escape valid URL characters?

They are valid, but it seems pretty normal to me that if you ask a programming language to url encode a string that it converts all non alpha numeric chars to the hex equivalent.
ASP's Server.URLEncode() does the same and php urlencode() does too except for - and _. Also, in javascript, the encodeURIComponent() function will encode all non alpha numeric chars to hex equivalents.
This is a good idea anyway to encode all non alpha numeric characters when using user input for forming server requests to prevent anything unexpected from happening.

Is the encoding of valid url characters causing an error or a problem?
One issue might be that by not doing so, if you embed a link with non-encoded characters in an email, the email software may decide to break the link into two lines.
If you use a fully encoded url though, the chances of this are greatly reduced. Just one way of seeing it though.

I could see at least in the case of " that it would be nice to have it encoded when using the URL as a link in an anchor tag.

Related

Regex to encode an URL with special characters

I need to encode a string that contains special characters such as whitespaces, ' and ".
I don't know how to create a regex.
I've tried many solutions but none of them seem to work.
My final objective is to have a string such as "black cat" encoded like this "black%20cat".
EDIT:
Guys I'm working with a specific software called "Axway policy studio" and it has a component where you put in a regex and a string, in the end you get boolean output such as true or false.
It sounds more like your trying to encode things to make them more appropriate for a URL, which does not require you to write your own regex in most platforms.
For instance, in Python, there's the function urllib.parse.urlencode which would do this. In Javascript, there's encodeURI and encodeURIComponent.
TL;DR Look up urlencode in and you'll probably find what you need. Don't bother writing regexes for it unless you really need to.
P.S. Most of the urlencoding is just replacing characters with % followed by their ascii hex value (' ' => %20, '!' => %21, ...)

Parsing text for URLs with trailing commas

I'm looking at a JSON feed from Twitter and trying to make URLs clickable using a regular expression.
The problem is that there are URLs in the text with trailing commas. A comma can legally be part of a URL, but in this case they're just punctuation inserted by the user.
Is there any way around this? Am I missing something?
You are not missing something; there is no fool-proof way of determining the "intended" URL if it is provided as and is surrounded by plaintext. Your best bet is to make an educated guess.
A common approach is to check if the punctuation mark(s) in question is followed by a whitespace or is the terminator of the string. If it is, do not interpret it as part of the URL; otherwise, include it.
Keep in mind this problem isn't limited to commas or a single character (consider the ellipsis, ...).
You could ignore the last character if it is punctuation (so that punctuation in the middle of a url doesn't affect it).
eg. Regex could be something like:
`([a-z/A-Z0-9.,]*?)([.,]?)\s`
Warning (the first part of the regex doesn't include all url stuff, so you still need to fix that. But essentially, we have ([a-z/A-Z0-9.,]*?) which matches the main part of the URL. the * allows many characters, but we use ? so that it isn't greedy.
Then we use ([.,]?) to match a possible trailing punctuation, and \s to match a space or whitespace.
The first subexpression is therefore the url, and you can turn it into a link.
If you have access to the internet, you could try accessing the resource to see if it returns a 404 to decide whether the trailing punctuation is part of the URL or actual punctuation.

Can anyone explain this regular expression to me in detail?

I have a RegEx here and I need to know if it will 100% omit any bad email addresses but I do not understand them fully so need to call on the community experts.
The string is as follows:
^[_a-zA-Z0-9-]+(.[_a-zA-Z0-9-]+)*#[a-zA-Z0-9-]+(.[a-zA-Z0-9-]+)*(.[a-zA-Z]{2,3})$
Thank you in advance!
Please, please, don't try to validate email addresses using regular expressions; this is a wheel that does not need re-inventing, and unless you write a horrendously hairy regular expression, you will let through invalid email addresses or reject valid ones.
There are plenty of modules on CPAN like Email::Valid which will take care of it all for you and are tried-and-tested.
Simple example:
use Email::Valid;
print (Email::Valid->address('someone#example.com') ? 'yes' : 'no');
Much simpler, and will just work.
Alternatively, using Mail::RFC822::Address:
if (Mail::RFC822::Address::valid('someone#example.com')) { ...}
For an example of how hairy a regular expression would have to be to successfully handle all RFC822-compliant addresses, take a look at this beauty.
People who try to hand-roll their own email address validation tend to end up with code that lets syntactically-invalid addresses slip through, and perhaps worse, reject perfectly valid addresses.
For example, some people use + in their address, like bob+amazon#example.com - this is known as an "address tag" or "sub-addressing". Quite a few naive attempts at validation would refuse that, and the customer will end up going elsewhere.
Also, in the past some people used to assume the TLD would always be 2 or 3 characters; when e.g. .info was launched, people with addresses at those domains would be told their perfectly-valid email address wasn't acceptable.
Finally, there are some pathological cases such as "Mickey Mouse"#example.com, bob#[1.2.3.4] which are syntactically-valid, but most people's hand-rolled validation would refuse.
^[_a-zA-Z0-9-]+(.[_a-zA-Z0-9-]+)*#[a-zA-Z0-9-]+(.[a-zA-Z0-9-]+)*(.[a-zA-Z]{2,3})$
Piece by piece
^ Start of the string
[_a-zA-Z0-9-]+ One or more characters of "_" (no quotes), a letter (a-z, A-Z), a number (0-9), or "-" (no quotes)
(.[_a-zA-Z0-9-]+)* zero or more substrings of type .something, or .123, or .a123. The substring must be formed by a . and a letter (same group of letters as before). So "." is not valid. ".a" or ".1" or ".-" is.
(up until now it will accept for example my.name12 or my.name12.surname34)
# a "#" (like max#something)
[a-zA-Z0-9-]+ One or more characters with the same pattern as before
(.[a-zA-Z0-9-]+)* Zero or more substrings of type ".something"... just as before
(.[a-zA-Z]{2,3}) A "." (dot) and 2 or 3 letters (a-z or A-Z)
$ The end of the string
So we have an email address, where you can't have something.#somethingelse.ss (no "dangling" dot before the #) or .something#somethingelse.ss (no beginning dot). The domain must start with a letter and can't have a dot just before the first level domain (.com/.uk/??), so no something#x..com. The first-level domain must have 2 or 3 letters (no numbers)
There is an error, the . (dot) must be escaped, so it should be \. . Depending on the language, the \ must be escaped in a string (so it could be \\.)
If I see it correctly, the following would be valid according to your regex: a#a#a#a#aa
The dot is the sign for any character!
Additionally, the following valid email address would not be accepted, although it should:
Someone%special#domain.de
Simple answer: it won't.
Next to the fact that a bad email address doesn't necessarily imply it's wrongly formatted (this_email_address_does_not_exist#someprovider.com is rightly formatted but is still bad), the RegEx will accept some bad addresses as well.
For example, the most right-hand part ((.[a-zA-Z]{2,3})$) states the verified string should end with a dot and then two or three letters. This will accept non-existing top level domain names (e.g. .aa) and will block four-letter TLD's (e.g. .info)
This RegEx will accept email addresses beginning with an underscore. That is (mostly) unacceptable.
You haven't placed any minimum limit on the size of the "username" (i.e. the part below "#" symbol). Thus, single character usernames will bypass this. Combined with the previous exception, email-ids of type _#something.com might escape undetected.
The . (dot) operator accepts any character. So, after the "#" part, (invalid) domains of type ##.com etc might be undetected.
Domains with only 2 or 3 chars are accepted, rest are ignored.
[_a-zA-Z0-9-]
Means you only want these characters (any alphanumeric char or '-' or '_') in your email address but it can be valid with all these characters : ! # $ % & ' * + - / = ? ^ _ ` { | } ~
The first part (before #) must be 253 characters long at most ({1,253}) and the second part (after #) can be 64 characters long max ({4,64}). (Add parenthesis to the first or second group before putting the ({4,64}) count limit)
If you want to know the EmailAddress Norm, just look wikipedia : The Article On Wiki
No, it will not exclude 100% of bad email addresses. Short of rejecting all addresses, this is impossible for a regex to accomplish because the vast majority of syntactically-valid addresses are for accounts which do not exist, such as shgercnhlch#stackoverflow.com.
The only way to truly verify the legitimacy of an email address is to attempt to send mail to it - and even that will only tell you that mail is accepted at that address, not that it is received by a human (as opposed to being fed to a script or silently discarded) and, even if it is received by a human, you have no guarantee that it's the human who claimed to own it. ("You insist that I have to give you a deliverable email address? Fine. My email address is president#whitehouse.gov.")
perhaps this regular expression will do?
^[_A-Za-z0-9-\+]+(\.[_A-Za-z0-9-]+)*#[A-Za-z0-9-]+(\.[A-Za-z0-9]+)*(\.[A-Za-z]{2,})$
taken from
http://www.mkyong.com/regular-expressions/how-to-validate-email-address-with-regular-expression/
To all the writers above that identify that the . accepts any character, I have found that in writing a response to another RegEx question, this edit-capture widget eats backslashes.
(IT'S A PROBLEM!)
Ok... Let's write it correctly:
^\s*([_a-zA-Z0-9]+(\\.[_a-zA-Z0-9\\-\\%]+)\*)#([a-zA-Z0-9]+(\\.[a-zA-Z0-9\\-]+)\*(\\.[a-zA-Z]{2,4}))\s*$
This also incorporates the % character as an allowed-inside value. The problem with this routine is that while it actually does a pretty good job parsing email addresses, it also is not very efficient, since RegEx is "greedy" and the terminating condition (which is supposed to match things like .com and .edu) will overshoot, then need to backtrack, costing considerable CPU time.
The real answer is to use the routines that are specific to this, as other posters have recommended. But if you don't have the CPAN modules, or the target environment does not, then the RegEx hack is arguably acceptable.

What is the proper regex for Active-Directory object's names?

My application creates a SharePoint site and an Active Directory group from user input. Special characters that are mentioned in http://www.webmonkey.com/reference/Special_Characters becomes a big problem in my application. Application creates group names differently and application can't access them from name property. I want the user input to be validated from a regular expression for these characters. I googled in and found some good regex sampler and testers but they won't solve my problem. So can anybody suggest a regex for disallowing special characters which is a problem for Active Directory object names?
P.S. Application users may enter Turkish inputs, so regex should also allow Turkish characters like 'ç', 'ş', 'ö'
You should start with something like this:
^(\p{L}|\p{N}|\p{P})+$
This will match:
\p{L}: any kind of letter from any language
\p{N}: any kind of numeric character in any script
\p{P}: any kind of punctuation character.
When you query your AD, you must to escape some special characters, described here: Creating a Query Filter
If any of the following special characters must appear in the query filter as literals, they must be replaced by the listed escape sequence.
ASCII Escape sequence
character substitute
* "\2a"
( "\28"
) "\29"
\ "\5c"
NUL "\00"
In addition, arbitrary binary data may be represented using the escape sequence syntax by encoding each byte of binary data with the backslash followed by two hexadecimal digits. For example, the four-byte value 0x00000004 is encoded as "\00\00\00\04" in a filter string.

Regex match, quite simple:

I'm looking to match Twitter syntax with a regex.
How can I match anything that is "#______" that is, begins with an # symbol, and is followed by no spaces, just letters and numbers until the end of the word? (To tweeters, I want to match someone's name in a reply)
Go for
/#(\w+)/
to get the matching name extracted as well.
#\w+
That simple?
It should be noted that Twitter no longer allows usernames longer than 15 characters, so you can also match with:
#\w{1,15}
There are still apparently a few people with usernames longer than 15 characters, but testing on 15 would be better if you want to exclude likely false positives.
There are apparently no rules regarding whether underscores can be used the the beginning or end of usernames, multiple underscores, etc., and there are accounts with single-letter names, as well as someone with the username "_".
#[\d\w]+
\d for a digit character
\w for a word character
[] to denote a character class
+ to represent more than one instances of the character class
Note that these specifiers for word and digit characters are language dependent. Check the language specification to be sure.
There is a very extensive API for how to get valid twitter names, mentions, etc. The Java version of the API provided by Twitter can be found on github twitter-text-java. You may want to take a look at it to see if this is something you can use.
I have used it to validate Twitter names and it works very well.