URL safe characters RegEx that will allow UTF-8 accents!

URL safe characters RegEx that will allow UTF-8 accents! - regex

I'm looking for a RegEx pattern to use in a rereplace() function that will keep URL safe characters, but include UTF-8 characters with accents. For example: ç and ã.
Something like: url = rereplace(local.url, "pattern") etc. I prefer a ColdFusion only solution, but I'm open to using Java too since it's so easy to integrate with CF.
My URL pattern will look like: /posts/[postId]/[title-with-accents-like-ç-and-ã]

I don't know what language you are using. Perl has some utf8 matching, see for example Tatsuhiko Miyagawa's URI::Find::UTF8

This can be done by matching alpha numeric characters using \w.
rereplace(string, "[^\w]", "", "all")
See this answer for reference.

Related

A pattern to match [characters]:[characters] inside an URL

I have an url like below and wanted to use RegEx to extract segments like: Id:Reference, Title:dfgdfg, Status.Title:Current Status, CreationDate:Logged...
This is the closest pattern I got [=,][^,]*:[^,]*[,&] but obviously the result is not as expected, any better ideas?
P.S. I'm using [^,] to matach any characters except , because , will not exist the segment.
This is the site using for regex pattern matching.
http://regexpal.com/
The URL:
http://localhost/site/=powerManagement.power&query=_Allpowers&attributes=Id:Reference,Title:dfgdfg,Status.Title:Current Status,CreationDate:Logged,RaiseUser.Title:标题,_MinutesToBreach&sort_by=CreationDate"
Thanks,

You haven't specified what programming language you use. But almost all with support this:
([\p{L}\.]+):([\p{L}\.]+)
\p{L} matches a Unicode character in any language, provided that your regex engine support Unicode. RegEx 101.
You can extract the matches via capturing groups if you want.

In python:
import re
matchobj = re.match("^.*Id:(.*?),Title:(.*?),.*$", url, )
Id = matchobj.group(1)
Title = matchobj.group(2)

RegEx to cut out URL

I try to get an URL from a String of the following format:
RANDOMRUBBISHhttps://www.my-url.com/randomfirstname_randomlastnameRANDOMRUBBISH
I already tried some things, especially the the look before/after, which I used before successfully on another url format (starts https... ends .html, this was working).
But seems I'm too stupid to figure out the regex for the kind of string mentioned above. I just want the URL part from https.... to the end of the random last name. Is this even possible?
Any Ideas?

If you can guarantee that randomfirstname_randomlastname is all lowercase and RANDOMRUBBISH is all uppercase, you can use character classes [a-z] and [A-Z]. The language the regex is for will determine how to use these.
This is example works in javascript:
var str = "RANDOMRUBBISHhttps://www.my-url.com/randomfirstname_randomlastnameRANDOMRUBBISH";
var match = /https:\/\/www\.my-url\.com\/[a-z]*/.exec(str);

converting javascript regex to php preg_match

I'm building a contact form using PHP with jQuery validation and I wanted them both to have the same email pattern. I looked into the Validation plugin source code and found this:
/^((([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+(\.([a-z]|\d|[!#\$%&'\*\+\-\/=\?\^_`{\|}~]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))#((([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|\d|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.)+(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])|(([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])([a-z]|\d|-|\.|_|~|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])*([a-z]|[\u00A0-\uD7FF\uF900-\uFDCF\uFDF0-\uFFEF])))\.?$/i
https://github.com/jzaefferer/jquery-validation/blob/1.8.1/jquery.validate.js#L1008
I tried plugging that into my existing php form validation, but it no longer recognizes anything. I tried various online regex test tools and some told me there was an error. Most didn't say anything more, but one said...
preg_match() [function.preg-match]: Compilation failed: PCRE does not support \L, \l, \N, \U, or \u at offset 45
http://www.solmetra.com/scripts/regex/index.php
I searched for unicode capital L, which is u004C, but I can't find any \u004C in the regex, so I don't know what is wrong or how to fix this.

If using PHP, don't use a regex, use filter_var()...
$validEmail = filter_var($email, FILTER_VALIDATE_EMAIL);

I agree with Alex - don't use a regex for this.
But for completeness' sake, this is what this (horrible) regex would look like in PHP:
/^((([a-z]|\d|[!#$%&\'*+\-\/=?\^_`{|}~]|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])+(\.([a-z]|\d|[!#$%&\'*+\-\/=?\^_`{|}~]|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])+)*)|((\x22)((((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(([\x01-\x08\x0b\x0c\x0e-\x1f\x7f]|\x21|[\x23-\x5b]|[\x5d-\x7e]|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])|(\\\\([\x01-\x09\x0b\x0c\x0d-\x7f]|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}]))))*(((\x20|\x09)*(\x0d\x0a))?(\x20|\x09)+)?(\x22)))#((([a-z]|\d|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])|(([a-z]|\d|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])([a-z]|\d|-|\.|_|~|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])*([a-z]|\d|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])))\.)+(([a-z]|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])|(([a-z]|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])([a-z]|\d|-|\.|_|~|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])*([a-z]|[\x{00A0}-\x{D7FF}\x{F900}-\x{FDCF}\x{FDF0}-\x{FFEF}])))\.?$/iu
Translated by RegexBuddy.

If you want to use PHP default regex:
<?php
$pattern = '/^(?!(?:(?:\\x22?\\x5C[\\x00-\\x7E]\\x22?)|(?:\\x22?[^\\x5C\\x22]\\x22?)){255,})(?!(?:(?:\\x22?\\x5C[\\x00-\\x7E]\\x22?)|(?:\\x22?[^\\x5C\\x22]\\x22?)){65,}#)(?:(?:[\\x21\\x23-\\x27\\x2A\\x2B\\x2D\\x2F-\\x39\\x3D\\x3F\\x5E-\\x7E]+)|(?:\\x22(?:[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F\\x21\\x23-\\x5B\\x5D-\\x7F]|(?:\\x5C[\\x00-\\x7F]))*\\x22))(?:\\.(?:(?:[\\x21\\x23-\\x27\\x2A\\x2B\\x2D\\x2F-\\x39\\x3D\\x3F\\x5E-\\x7E]+)|(?:\\x22(?:[\\x01-\\x08\\x0B\\x0C\\x0E-\\x1F\\x21\\x23-\\x5B\\x5D-\\x7F]|(?:\\x5C[\\x00-\\x7F]))*\\x22)))*#(?:(?:(?!.*[^.]{64,})(?:(?:(?:xn--)?[a-z0-9]+(?:-+[a-z0-9]+)*\\.){1,126}){1,}(?:(?:[a-z][a-z0-9]*)|(?:(?:xn--)[a-z0-9]+))(?:-+[a-z0-9]+)*)|(?:\\[(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){7})|(?:(?!(?:.*[a-f0-9][:\\]]){7,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,5})?)))|(?:(?:IPv6:(?:(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){5}:)|(?:(?!(?:.*[a-f0-9]:){5,})(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3})?::(?:[a-f0-9]{1,4}(?::[a-f0-9]{1,4}){0,3}:)?)))?(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))(?:\\.(?:(?:25[0-5])|(?:2[0-4][0-9])|(?:1[0-9]{2})|(?:[1-9]?[0-9]))){3}))\\]))$/iD';
$emailaddress = 'xyz.opq2604#email.com';
if (preg_match($pattern, $emailaddress) === 1) {
// emailaddress is valid
}

Skype smileys REGEXP patterns where/how to get?

I would like to understand the patterns they are using for their smileys.

If emoticons are only replaced if surrounded by whitespace or (presumably) at start/end of a line/string, then you can use a series of regexes.
Using this list (taken from http://www.skype-forum.com/ftopic13197.html),...
you can construct these like this:
(?<=^|\s)<<smiley regex>>(?=\s|$)
will match <<smiley regex>> only if it's on its own.
Examples for <<smiley regex>>:
:-?\) :-?\( :-?D 8\)
;\( \(sweat\) :\| :\*
:\$ :\^\) \|-\) \|\(
;\) \]:\) \(talk\) \(yawn\)
\(doh\) :# \(wasntme\) \(party\)
etc. - you'll need to escape a lot of special-meaning characters for use in a regex. Your language might have a re.escape() function for this.

match url that doesnt contain asp, apsx, css, htm.html,jpg

Q-1. match url that doesn't contain asp, apsx, css, htm.html,jpg,
Q-2. match url that doesn't end with asp, apsx, css, htm.html,jpg,

You want to use the 'matches count' function, and make it match 0.
eg.
(matches all characters, then a dot, then anything that isnt aspx or css
^.*\.((aspx) | (css)){0}.*$
Edit,
added ^ (start) and $ (end line chars)

Q-1. This is better done using a normal string search, but if you insist on regex: (.(?!asp|apsx|css|htm|html|jpg))*.
Q-2. This is better done using a normal string search, but if you insist on regex: .*(?<!asp|css|htm|jpg)(?<!aspx|html)$.

If your regular expression implementation does allow lookaround assertions, try these:
(?:(?!aspx?|css|html?|jpg).)*
.*$(?<!aspx?|css|html?|jpg)

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

URL safe characters RegEx that will allow UTF-8 accents! - regex

I don't know what language you are using. Perl has some utf8 matching, see for example Tatsuhiko Miyagawa's URI::Find::UTF8

This can be done by matching alpha numeric characters using \w. rereplace(string, "[^\w]", "", "all") See this answer for reference.

Related

A pattern to match [characters]:[characters] inside an URL

RegEx to cut out URL

converting javascript regex to php preg_match

Skype smileys REGEXP patterns where/how to get?

match url that doesnt contain asp, apsx, css, htm.html,jpg

Categories

Resources