Regex pattern to format url - regex

I have this pattern ^(?:http://)?(?:www.)?(.*?)/?(.*?)$ but it's still not perfect.
Let's say we have these urls to test against it:
example.com
example.com/
www.example.com/
http://example.com/
example.com/param
http://example.com/params/
The final output should be example.com/ if there's no parameters and example.com/params/ if with parameters. My problem is that it matches only second group. It doesn't look like /? is working otherwise it would stop on slash character. Is it possible to achieve what I want using only one pattern?

So you want the host name in $1? Your regex is ambiguous, there are many ways to match it; the regex engine will prefer the longest, leftmost possible match. If you don't want slashes in the first part, then say so. Explicitly. (?:http://)?(?:www\.)?([^/]*)?/?(.*)?$

One that I've used is:
((?:(?:https?://)?[\w\d:##%/;$()~_?\+\-=&]+|www|ftp)\.[\w\d:##%/;$()~_?\+\-=&\.]+)
The problem with URLs is that there are SO many ways one can be written, which is why the above code looks so congested. This will match all your examples above, but it will also match things like:
alkasi.jaias
Hopefully this will get you headed to where you need or want to go, and perhaps someone might be able to come up behind me and clean it up some (it's early morning, I'm getting ready for work, and am exhausted. :P)

Related

Find last occurrence of period with regex

I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.
So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.
Here's my regex (or find it on regexr):
/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/
It works well for links such as foo.com or http://foo.com or foo.co.uk
The problem appears when you introduce subdomains or second-level domains such as co.uk because the regex will accept foo.co.u or foo.co..
I did try using the following to select the substring after the last .:
/[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}[^.]{2,}$/
but this prevents me from defining the path rules of the URI.
How can I ensure that the substring after the last . but before the first /, ? or # is at least 2 characters long?
From what I can see, you're almost there. Made some modification and it seems to work.
^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
Can be somewhat shortened by doing
^(http(s)?:\/\/)?(www\.)?[\w#:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
(basically just tweaked your regex)
The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;. That part could probably be simplified as well.
Check it out here.
Edit:
After some experimenting I think this one is about as simple it'll get:
^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([#/#?;].*)?$
It also captures the separate parts - scheme, host, port, path and query/params.
Example here.

Regex, optional match in url

I spend a couple of hour with no good result (maybe my mood is not helping about it).
I am trying to build a regex to help me match both urls:
/reservables/imagenes/4/editar/6
/reservables/imagenes/4/subir
As you note above, the last segment in the first url 6 is not present at the end of the second url, because this segments is optional here. So I need to match both urls in one regex, for that, I have tried this:
reservables/(editar|imagenes)/([0-9]+)/(imagen|editar|actualizar|subir)/([0-9]+)
That works fine only for the first url. So, reading a few notes about regex it suggest me that I need the ? symbol, right? So, I tried this one, but it did not work:
reservables/(editar|imagenes)/([0-9]+)/(imagen|editar|actualizar|subir)/([0-9]+)?
Well, I do not what I am doing wrong.
You want to put the ? around the / as well, like so:
reservables/(editar|imagenes)/([0-9]+)/(imagen|editar|actualizar|subir)(?:/([0-9]+))?
You can see that it matches correctly on debuggex.
This one will work:
reservables/(editar|imagenes)/([0-9]+)/(imagen|editar|actualizar|subir)/([0-9]*)

Using RegEx to match domain.com and www.domain.com but NO OTHER subdomains?

Sorry if this has been asked elsewhere, I've been looking and can't find it for the life of me. I am attempting at tackling regular expressions, I've ALWAYS had problems with the more advanced scenarios... well, others find them quite easy, so maybe there's something wrong with me.
Anyway, I am attempting to write a RegEx that matches www.domain.com OR domain.com but NO OTHER SUBDOMAINS or anything. The only two strings I want to pass the regex are "domain.com" and "www.domain.com" and I haven't been able to find exactly what I am looking for other than including all subdomain matching, which I find easy.
The closest I have come is this: regex for matching something if it is not preceded by something else but in that case its failing only for one preceding string, I want it to succeed for only one preceding string/subdomain. Note, "domain.com" will always be static, meaning it will always be that exact string "domain.com" not various domains.
Thanks so much for shedding light on this!
Tyler
Just put the optional part in a non-capturing group, and make it optional.
/^(?:www\.)?example.com$/

Regex Problem (newbie)

i'm writing a little app for spam-checking and i'm having problems with a regex.
let's say i'm having this spam-url:
http://hosting.tyumen.ru/tip.html
so i want to check its url for having 2 full stops (subdomain+ending), a slash, a word, full stop and "html".
here's what i got so far:
(http://.*?\..*?..*?/.*?.html)
might look like rubbish but it works - the problem: it's really slow and freezing my app.
any hints on how to optimize it?
thx.re
The reason it's slow is that the non-greedy operators ? being used this way is prone to catastrophic backtracking
Instead of saying "any amount of anything, but only to an extent where it doesn't conflict with later requirements", which is effectively what .*? is saying, try asking for "as much as possible, that isn't a double quote, which would terminate the href ":
\1
I also added a back-reference (\1) to your first capturing group, inside the <a>...</a>, so that you don't have to do the exact same matching all over again.
Note that this regex will be broken if, say, the a has a class name, an id, or anything else in its body. I left it like this because I wanted to give you what you asked for with as few changes as possible, and as to-the-point as possible.
(http://[\w.-]+/.+?\.html) - may be will work for your case only.
or may be faster one
(http://[\w.-]+/[^.]+\.html)
Since you claim to be a regexp newbie, I will offer a more general advice on creating and debugging regular expressions. When they get pretty complicated, I find using Regexp Coach a must.
It's a freeware and really saves a lot of headache. Not to mention you don't have to build / run your application every minute just to see if the regexp works the way you wanted.
In Python, a simple way to match URLs ending in .html or .htm is to use
url_re = re.compile(
r'https?://' # http:// or https://
r'(?:(?:[A-Z0-9](?:[A-Z0-9-]{0,61}[A-Z0-9])?\.)+[A-Z]{2,6}\.?|' #domain...
r'localhost|' #localhost...
r'\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})' # ...or ip
r'(?::\d+)?' # optional port
r'(?:\S+.html?)+' # ending in .html
, re.IGNORECASE)
which is a modified version of Django's UrlField regex.
This will match any site ending with .html or .htm. (either localhost, ip, domain).
#http://[-a-zA-Z0-9]+\.[-a-zA-Z0-9]+\.[-a-zA-Z]+/\w+\.html#

Regular Expression to match multiple query string parameter/value pairs

About to work through this one, but thought someone may have already had to tackle it, so...
I'm looking for an elegant (and isapi rewrite compatible) regular expression to look for three known parameter/value pairs in a querystring, regardless of order, and also extract all other parameters while stripping out those three.
abc=123 def=456 and ghi=789 are all known, fixed strings. They may appear in any order in the querystring, and may or may not be the only parameters, may or may not be adjacent. It should be smart and not match aaabc=123 or abc=1234 (so each searched parameter should be bracketed by &, ?, #, or end of string). The output I want is a new query string with the remaining params stripped out.
I'll probably be taking a stab at the logic in the morning, so bonus points if you can solve it before I try to then.
I think regexes shouldn't be used for problems of this type. Just tokenize the string, and compare every parameter's name to what you are looking for.
s/(\?|\#|\&)(abc=123|def=456|ghi=789)(\&|\#|$)//g
This is approximate and untested, but presents a working (I think) concept. Basically, look for starting border, literal string, then end border, replacing each with null, globally, and using | to give alternate options for each.
Here's what I've come up with:
RewriteRule ^/oldpage.htm\?(.\*)(?<=\?|&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:(?:abc=123|def=456|ghi=789)(?:&|#|$))(.\*) /newpage.htm?$1$2$3 [I,RP,L]
which I think works. the lookAhead/lookbehind qualifiers, (?<= and (?= , seem to be the key to allowing me to look for the encompassing & or ? without "consuming it" to mess up the next match.
One gotcha is that if the old page url only has the three params, I still end up with a trailing ? with no parameters on the redirected url, "/newpage.htm?". I'm currently planning to avoid that by using a RewriteCond to only look at urls with 4+ params before this fires, and have a simpler match regex for the ones with exactly three..so the full ruleset comes out to:
RewriteCond URL ^/oldpage.htm\?([^#]\*=[^#]\*&){3,}[^#]\*=[^#]\*.\*
RewriteRule ^/oldpage.htm\?(.\*)(?<=\?|&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:(?:abc=123|def=456|ghi=789)(?:&|#|$))(.\*) /newpage.htm?$1$2$3 [I,RP,L]
RewriteRule ^/oldpage.htm\?(?:abc=123|def=456|ghi=789)&(?:abc=123|def=456|ghi=789)&(?:abc=123|def=456|ghi=789)(.\*) /newpage.htm$1 [I,RP,L]
(the $1 at the end is for #additions to the url...do I really need it?) The other issue is I suppose a url of /oldpage.htm?abc=123&abc=123&abc=123 would trigger this, but I don't see any easy way around that, and am not too worried about it..
Can anyone think of a better way to approach this, or see any other issues?
There are querystring decoders. There are many connected topics, especially on this site.
Some of them.
First
Second
And javadocs link for apache decoder.