Regex for URL matching - regex

I want to match below two URLs.
1. /,a=e[o],e[o]=function(){s=arguments},i.always(function(){e[o]=a,n[o]&&(n.jsonpcallback=r.jsonpcallback,fn.push(o)),s&&x.isfunction(a)&&a(s[0]),s=a=t}),
2. /,a[f]=function(){h=arguments},e.always(function(){a[f]=g,c[f]&&(c.jsonpcallback=d.jsonpcallback,ce.push(f)),h&&p.isfunction(g)&&g(h[0]),h=g=b}),
For that regex is :
^[a-zA-Z0-9:\/\.,\[\]\=\(\) \{\}\=\&]{0,500}$
But above mention Reg-ex match with :
https://www.test.com/test/test/test.php
I Want to write reg-ex where all special character like []{}()&,. In above two URL are compulsory but if this all mention special character is not available then reg-ex should not match.

Short Answer
^(?=.*\[)(?=.*])(?=.*\{)(?=.*})(?=.*\()(?=.*\))(?=.*&)(?=.*,)(?=.*\.)[a-zA-Z0-9:\/\.,\[\]\=\(\) \{\}\=\&]{0,500}$
Longer Answer
You can use a positive lookahead to ensure the result contains a set of characters.
For example, add this to the start:
(?=.*\[)
And it will only match results that contain an opening square bracket [
You can do this for each of the special characters that you need to ensure are present.
For example, if you want to ensure it contains all of the characters []{}()&,. then you would add this at the start:
(?=.*\[)(?=.*\])(?=.*\{)(?=.*\})(?=.*\()(?=.*\))(?=.*\&)(?=.*\,)(?=.*\.)
Just be sure to escape the relevant characters depending on your programming language and type of regex

Related

Is there a way to use periodicity in a regular expression?

I'm trying to find a regular expression for a Tokenizer operator in Rapidminer.
Now, what I'm trying to do is to split text in parts of, let's say, two words.
For example, That was a good movie. should result to That was, was a, a good, good movie.
What's special about a regex in a tokenizer is that it plays the role of a delimiter, so you match the splitting point and not what you're trying to keep.
Thus the first thought is to use \s in order to split on white spaces, but that would result in getting each word separately.
So, my question is how could I force the expression to somehow skip one in two whitespaces?
First of all, we can use the \W for identifying the characters that separate the words. And for removing multiple consecutive instances of them, we will use:
\W+
Having that in mind, you want to split every 2 instances of characters that are included in the "\W+" expression. Thus, the result must be strings that have the following form:
<a "word"> <separators that are matched by the pattern "\W+"> <another "word">
This means that each token you get from the split you are asking for will have to be further split using the pattern "\W+", in order to obtain the 2 "words" that form it.
For doing the first split you can try this formula:
\w+\W+\w+\K\W+
Then, for each token you have to tokenize it again using:
\W+
For getting tokens of 3 "words", you can use the following pattern for the initial split:
\w+\W+\w+\W+\w+\K\W+
This approach makes use of the \K feature that removes from the match everything that has been captured from the regex up to that point, and starts a new match that will be returned. So essentially, we do: match a word, match separators, match another word, forget everything, match separators and return only those.
In RapidMiner, this can be implemented with 2 consecutive regex tokenizers, the first with the above formula and the second with only the separators to be used within each token (\W+).
Also note that, the pattern \w selects only Latin characters, so if your documents contain text in a different character set, these characters will be consumed by the \W which is supposed to match the separators. If you want to capture text with non-Latin character sets, like Greek for example, you need to change the formula like this:
\p{L}+\P{L}+\p{L}+\K\P{L}+
Furthermore, if you want the formula to capture text on one language and not on another language, you can modify it accordingly, by specifying {Language_Identifier} in place of {L}. For example, if you only want to capture text in Greek, you will use "{Greek}", or "{InGreek}" which is what RapidMiner supports.
What you can do is use a zero width group (like a positive look-ahead, as shown in example). Regex usually "consumes" characters it checks, but with a positive lookahead/lookbehind, you assert that characters exist without preventing further checks from checking those letters too.
This should work for your purposes:
(\w+)(?=(\W+\w+))
The following pattern matches for each pair of two words (note that it won't match the last word since it does not have a pair). The first word is in the first capture group, (\w+). Then a positive lookahead includes a match for a sequence of non word characters \W+ and then another string of word characters \w+. The lookahead (?=...) the second word is not "consumed".
Here is a link to a demo on Regex101
Note that for each match, each word is in its own capture group (group 1, group 2)
Here is an example solution, (?=(\b[A-Za-z]+\s[A-Za-z]+)) inspired from this SO question.
My question sounds wrong once you understand that is a problem of an overlapping regex pattern.

How can I use a regular expression to match words of a certain length but not urls?

For text such as
Save Favorites & Share expressions with friends or the Community.
A full Reference & Help is available in the Library, or watch the video Tutorial.
expressions can start some lines though eventuallys
abcdefghijklmnopqrstuvwxyz ABCDEFGHIJKLMNOPQRSTUVWXYZ
http://regexr.com/foo.html?q=bar
https://mediatemple.net
mediatemple.net
I want to select words tha are 11 digits long.
I can use
/\b[a-zA-Z]{11}\b/g
(http://regexr.com/3digk)
but it also matches the urls
https://mediatemple.net
mediatemple.net
How can I avoid that? I use \b rather than a space to match at the start and end of lines
By using negative lookahead, you could exclude the words which have .something after them, this would exclude any URL and not touch the words in the end of the sentence (i.e. if a space is following the dot or the newline).
/\b[a-zA-Z]{11}\b(?!\.[^\s]+)/g
You can use negative look behind expression to ensure that your match is not preceded by "://".
Use (?<!//), which is a negative look behind that asserts the preceding chars are not "//":
/(?<!//)\b[a-zA-Z]{11}\b/g
See live demo.
If you want to be more specific and allow double slashes, eg "foo//elevenchars", you can use 2 negative look behinds - one for each protocol (look behinds must match fixed length):
/(?<!http://)(?<!https://)\b[a-zA-Z]{11}\b/g
See live demo, matching foo//elevenchars, but not the urls.

Regular expression to correct email address

I need help in writing one regular expression where I want to remove unwanted characters in the start and end of the email address. For example:
z>user1#hotmail.com<kt
z>user2#hotmail.pk<kt
z>puser3#yahoo.com<kt
z>npuser4#yaoo.uk<kt
After applying regular expression my emails should look like:
user1#hotmail.com
user2#hotmail.pk
puser3#yahoo.com
npuser4#yaoo.uk
Regular expression should not applied if email address is already correct.
You can try deleting matches of
^[^>]*>|<[^>]*$
(demo)
Debuggex Demo
Find ^[^>]*>([^<]*)<*.*$ and replace it with \1
Here's an example on regex101
I think you might be missing the point of a regular expression slightly. A regular expression defines the 'shape' of a string and return whether or not the string conforms to that shape. A simple expression for an email address might be something like:
[a-z][A-Z][0-9]*.?[a-z][A-Z][0-9]+#[a-z][A-Z][0-9]*.[a-z]+
But it is not simple to write one catch-all regular expression for an email address. Really, what you need to do to check it properly is:
Ensure there is one and only one '#'-sign.
Check that the part before the at sign conforms to a regular expression for this part:
Characters
Digits
Extended characters: .-'_ (that list may not be complete)
Check that the part after the #-sign conforms to the reg-ex for domain names:
Characters
Digits
Extended characters: . -
Must start with character or digit and must end with a proper domain name ending.
Try using a capturing group on anything between the characters you don't want. For example,
/>([\w|\d]+#[\w\d]+.\w+)</
Basically, any part that the regexp inside () matches is saved in a capturing group. This one matches anything that's inside >here< that starts with a bunch of characters or digits, has an #, has one or more word or digit characters, then a period, then some word characters. Should match any valid email address.
If you need characters besides >< to be matched, make a character class. That's what those square bracketed bits are. If you replace > with [.,></?;:'"] it'll match any of those characters.
Demo (Look at the match groups)

Limiting RegEx to match only a string of 1-254 characters length

This is my RegEx:
"^[^\.]([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)([\.]{0,1})([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)[^\.]#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,6}|[0-9]{1,3})(\]?)$"
I need to match only strings less than 255 characters.
I've tried adding the word boundaries at the start of the RegEx but it fails:
"^(?=.{1,254})[^\.]([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)([\.]{0,1})([\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]+)[^\.]#((\[[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.)|(([\w-]+\.)+))([a-zA-Z]{2,6}|[0-9]{1,3})(\]?)$"
You need the $ in the lookahead to make sure it's only up to 254. Otherwise, the lookahead will match even when there are more than 254.
(?=.{1,254}$)
Also, keep in mind that you can greatly simplify your regex because many characters that would usually need to be escaped do not need to when in a character class (square brackets).
"[\w-\!\#\$\%\&\'\*\+\-\/\=\`\{\|\}\~\?\^]"
is the same as this:
"[-\w!#$%&'*+/=`{|}~?^]"
Note that the dash must be first in the character class to be a literal dash, and the caret must not be first.
With some other simplifications, here is the complete string:
"^(?=.{1,254}$)[-\w!#$%&'*+/=`{|}~?^]+(\.[-\w!#$%&'*+/=`{|}~?^]+)*#((\d{1,3}\.){3}\d{1,3}|([-\w]+\.)+[a-zA-Z]{2,6})$"
Notes:
I removed the stipulation that the first char shouldn't be a period ([^.]) because the next character class doesn't match a period anyway, so it's redundant.
I removed many extraneous parens
I replaced [0-9] with \d
I replaced {0,1} with the shorthand "?"
After the # sign, it seemed that you were trying to match an IP address or text domain name, so I separated them more so it couldn't be a combination
I'm not sure what the optional square bracket at the end was for, so I removed it: "(]?)"
I tried it in Regex Hero, and it works. See if it works for you.
This depends on what language you are working in. In Python for example you can regex to split a text into separate strings, and then use len() to remove strings longer than the 255 characters you want
I think this post will help. It shows how to limit certain patterns but I am not sure how you would add it to the entire regex.

Regular expression to prevent adjacent repeating dashes

In my asp.net application I am restricting allowed URL formats with regular expressions.I need to create regular expression which will not allow adjacent dashes in URLs
01) allow URLs like
text1-text2.htm
text1-text2-textn.htm
02) prevent URLS like
text1--text2.htm
text1--text2-textn.htm
Try this regex:
/--/
If you found a match then it means the URL had two dashes.
url.Contains("--") will work for you, where the url variable is the url entered. Nice and concise, and you don't have to fuss with a RegEx.
The negative answer posted by Aziz is best, but just for completeness sake here is a regex that matches the kinds of strings you wish to accept (as opposed to reject):
You want a string made up of zero or more of the following:
a non-dash character, or
a dash followed by a non-dash
A regex for this is
/^(?:[^-]|-(?!-))*$/
Now you can adjust the [^-] part to accept not just any character at all, but only those characters permitted in a URL (that is, if you wish to match all possible urls except those with two consecutive dashes). To do this you will have to find the RFC that gives the URI syntax. Will be somewhat tedious, which is why the negative solution with /--/ combined with other checks is your best bet.
This will match a filename with 0 or more occurences of a single dash followed by a some word characters.
^\w+(-\w+)*\.\w+
Should be enough to search for the problem -{2,} and then do the negation. Ie as long as this regex (two or more dashes in a row) does not match, it's valid.
Or positive regex matching only urls you do want: ^([A-Za-z0-9]+-?)+\.htm$