Regular Expression to match multiple query string parameter/value pairs - regex

About to work through this one, but thought someone may have already had to tackle it, so...
I'm looking for an elegant (and isapi rewrite compatible) regular expression to look for three known parameter/value pairs in a querystring, regardless of order, and also extract all other parameters while stripping out those three.
abc=123 def=456 and ghi=789 are all known, fixed strings. They may appear in any order in the querystring, and may or may not be the only parameters, may or may not be adjacent. It should be smart and not match aaabc=123 or abc=1234 (so each searched parameter should be bracketed by &, ?, #, or end of string). The output I want is a new query string with the remaining params stripped out.
I'll probably be taking a stab at the logic in the morning, so bonus points if you can solve it before I try to then.

I think regexes shouldn't be used for problems of this type. Just tokenize the string, and compare every parameter's name to what you are looking for.

s/(\?|\#|\&)(abc=123|def=456|ghi=789)(\&|\#|$)//g
This is approximate and untested, but presents a working (I think) concept. Basically, look for starting border, literal string, then end border, replacing each with null, globally, and using | to give alternate options for each.

Here's what I've come up with:
RewriteRule ^/oldpage.htm\?(.\*)(?<=\?|&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:(?:abc=123|def=456|ghi=789)(?:&|#|$))(.\*) /newpage.htm?$1$2$3 [I,RP,L]
which I think works. the lookAhead/lookbehind qualifiers, (?<= and (?= , seem to be the key to allowing me to look for the encompassing & or ? without "consuming it" to mess up the next match.
One gotcha is that if the old page url only has the three params, I still end up with a trailing ? with no parameters on the redirected url, "/newpage.htm?". I'm currently planning to avoid that by using a RewriteCond to only look at urls with 4+ params before this fires, and have a simpler match regex for the ones with exactly three..so the full ruleset comes out to:
RewriteCond URL ^/oldpage.htm\?([^#]\*=[^#]\*&){3,}[^#]\*=[^#]\*.\*
RewriteRule ^/oldpage.htm\?(.\*)(?<=\?|&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:(?:abc=123|def=456|ghi=789)(?:&|#|$))(.\*) /newpage.htm?$1$2$3 [I,RP,L]
RewriteRule ^/oldpage.htm\?(?:abc=123|def=456|ghi=789)&(?:abc=123|def=456|ghi=789)&(?:abc=123|def=456|ghi=789)(.\*) /newpage.htm$1 [I,RP,L]
(the $1 at the end is for #additions to the url...do I really need it?) The other issue is I suppose a url of /oldpage.htm?abc=123&abc=123&abc=123 would trigger this, but I don't see any easy way around that, and am not too worried about it..
Can anyone think of a better way to approach this, or see any other issues?

There are querystring decoders. There are many connected topics, especially on this site.
Some of them.
First
Second
And javadocs link for apache decoder.

Related

Find last occurrence of period with regex

I'm trying to create a regex for validating URLs. I know there are many advanced ones out there, but I want to create my own for learning purposes.
So far I have a regex that works quite well, however I want to improve the validation for the TLD part of the URI because I feel it's not quite there yet.
Here's my regex (or find it on regexr):
/^[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}\b([/#?]{0,1}([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)$/
It works well for links such as foo.com or http://foo.com or foo.co.uk
The problem appears when you introduce subdomains or second-level domains such as co.uk because the regex will accept foo.co.u or foo.co..
I did try using the following to select the substring after the last .:
/[(http(s)?):\/\/(www\.)?a-zA-Z0-9#:._\+~#=]{2,256}[^.]{2,}$/
but this prevents me from defining the path rules of the URI.
How can I ensure that the substring after the last . but before the first /, ? or # is at least 2 characters long?
From what I can see, you're almost there. Made some modification and it seems to work.
^(http(s)?:\/\/)?(www\.)?[a-zA-Z0-9#:._\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([A-Za-z0-9-._~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
Can be somewhat shortened by doing
^(http(s)?:\/\/)?(www\.)?[\w#:.\+~#=]{2,256}\.[a-zA-Z]{2,6}([/#?;]([-\w.~:?#[\]#!$&''()*+,;=]|(%[A-Fa-f0-9]{2}))*)?$
(basically just tweaked your regex)
The main difference is that the parameter part is optional, but if it is there it has to start with one of /#?;. That part could probably be simplified as well.
Check it out here.
Edit:
After some experimenting I think this one is about as simple it'll get:
^(http(?:s)?:\/\/)?([-.~\w]+\.[a-zA-Z]{2,6})(:\d+)?(\/[-.~\w]*)?([#/#?;].*)?$
It also captures the separate parts - scheme, host, port, path and query/params.
Example here.

Regex for URL routing - match alphanumeric and dashes except words in this list

I'm using CodeIgniter to write an app where a user will be allowed to register an account and is assigned a URL (URL slug) of their choosing (ex. domain.com/user-name). CodeIgniter has a URL routing feature that allows the utilization of regular expressions (link).
User's are only allowed to register URL's that contain alphanumeric characters, dashes (-), and under scores (_). This is the regex I'm using to verify the validity of the URL slug: ^[A-Za-z0-9][A-Za-z0-9_-]{2,254}$
I am using the url routing feature to route a few url's to features on my site (ex. /home -> /pages/index, /activity -> /user/activity) so those particular URL's obviously cannot be registered by a user.
I'm largely inexperienced with regular expressions but have attempted to write an expression that would match any URL slugs with alphanumerics/dash/underscore except if they are any of the following:
default_controller
404_override
home
activity
Here is the code I'm using to try to match the words with that specific criteria:
$route['(?!default_controller|404_override|home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
but it isn't routing properly. Can someone help? (side question: is it necessary to have ^ or $ in the regex when trying to match with URL's?)
Alright, let's pick this apart.
Ignore CodeIgniter's reserved routes.
The default_controller and 404_override portions of your route are unnecessary. Routes are compared to the requested URI to see if there's a match. It is highly unlikely that those two items will ever be in your URI, since they are special reserved routes for CodeIgniter. So let's forget about them.
$route['(?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254}'] = 'view/slug/$1';
Capture everything!
With regular expressions, a group is created using parentheses (). This group can then be retrieved with a back reference - in our case, the $1, $2, etc. located in the second part of the route. You only had a group around the first set of items you were trying to exclude, so it would not properly capture the entire wild card. You found this out yourself already, and added a group around the entire item (good!).
$route['((?!home|activity)[A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Look-ahead?!
On that subject, the first group around home|activity is not actually a traditional group, due to the use of ?! at the beginning. This is called a negative look-ahead, and it's a complicated regular expression feature. And it's being used incorrectly:
Negative lookahead is indispensable if you want to match something not followed by something else.
There's a LOT more I could go into with this, but basically we don't really want or need it in the first place, so I'll let you explore if you'd like.
In order to make your life easier, I'd suggest separating the home, activity, and other existing controllers in the routes. CodeIgniter will look through the list of routes from top to bottom, and once something matches, it stops checking. So if you specify your existing controllers before the wild card, they will match, and your wild card regular expression can be greatly simplified.
$route['home'] = 'pages';
$route['activity'] = 'user/activity';
$route['([A-Za-z0-9][A-Za-z0-9_-]{2,254})'] = 'view/slug/$1';
Remember to list your routes in order from most specific to least. Wild card matches are less specific than exact matches (like home and activity), so they should come after (below).
Now, that's all the complicated stuff. A little more FYI.
Remember that dashes - have a special meaning when in between [] brackets. You should escape them if you want to match a literal dash.
$route['([A-Za-z0-9][A-Za-z0-9_\-]{2,254})'] = 'view/slug/$1';
Note that your character repetition min/max {2,254} only applies to the second set of characters, so your user names must be 3 characters at minimum, and 255 at maximum. Just an FYI if you didn't realize that already.
I saw your own answer to this problem, and it's just ugly. Sorry. The ^ and $ symbols are used improperly throughout the lookahead (which still shouldn't be there in the first place). It may "work" for a few use cases that you're testing it with, but it will just give you problems and headaches in the future.
Hopefully now you know more about regular expressions and how they're matched in the routing process.
And to answer your question, no, you should not use ^ and $ at the beginning and end of your regex -- CodeIgniter will add that for you.
Use the 404, Luke...
At this point your routes are improved and should be functional. I will throw it out there, though, that you might want to consider using the controller/method defined as the 404_override to handle your wild cards. The main benefit of this is that you don't need ANY routes to direct a wild card, or to prevent your wild card from goofing up existing controllers. You only need:
$route['404_override'] = 'view/slug';
Then, your View::slug() method would check the URI, and see if it's a valid pattern, then check if it exists as a user (same as your slug method does now, no doubt). If it does, then you're good to go. If it doesn't, then you throw a 404 error.
It may not seem that graceful, but it works great. Give it a shot if it sounds better for you.
I'm not familiar with codeIgniter specifically, but most frameworks routing operate based on precedence. In other words, the default controller, 404, etc routes should be defined first. Then you can simplify your regex to only match the slugs.
Ok answering my own question
I've seem to come up with a different expression that works:
$route['(^(?!default_controller$|404_override$|home$|activity$)[A-Za-z0-9][A-Za-z0-9_-]{2,254}$)'] = 'view/slug/$1';
I added parenthesis around the whole expression (I think that's what CodeIgniter matches with $1 on the right) and added a start of line identifier: ^ and a bunch of end of line identifiers: $
Hope this helps someone who may run into this problem later.

Regex pattern to format url

I have this pattern ^(?:http://)?(?:www.)?(.*?)/?(.*?)$ but it's still not perfect.
Let's say we have these urls to test against it:
example.com
example.com/
www.example.com/
http://example.com/
example.com/param
http://example.com/params/
The final output should be example.com/ if there's no parameters and example.com/params/ if with parameters. My problem is that it matches only second group. It doesn't look like /? is working otherwise it would stop on slash character. Is it possible to achieve what I want using only one pattern?
So you want the host name in $1? Your regex is ambiguous, there are many ways to match it; the regex engine will prefer the longest, leftmost possible match. If you don't want slashes in the first part, then say so. Explicitly. (?:http://)?(?:www\.)?([^/]*)?/?(.*)?$
One that I've used is:
((?:(?:https?://)?[\w\d:##%/;$()~_?\+\-=&]+|www|ftp)\.[\w\d:##%/;$()~_?\+\-=&\.]+)
The problem with URLs is that there are SO many ways one can be written, which is why the above code looks so congested. This will match all your examples above, but it will also match things like:
alkasi.jaias
Hopefully this will get you headed to where you need or want to go, and perhaps someone might be able to come up behind me and clean it up some (it's early morning, I'm getting ready for work, and am exhausted. :P)

Can one use named backreference's in Apache mod_rewrite

All,
I've come across an interesting little quirk in one of my RewriteRules, which I wanted to resolve by the use of named back references. However from what I can see, this is not possible in Apache's mod_rewrite.
I have two incoming urls, each containing a key variable, which need to be rewritten to the same underlying framework action.
Incoming urls:
/users/list/page-2
/users/list/2
Desired rewrite endpoint
/?module=users&action=list&pagenum=2
I would have liked to do something like this
RewriteRule ^/(?P<module>([\w]+))/(?P<action>([\w]+))/(page-)?(?P<pagenum>([\d]+))$ /?module=${module}&action=${action}&pagenum=${pagenum} [L,QSA]
However Apache just doesn't want to play like that at all, and gives me null values in the places of the named backreferences. To get me round the problem I've used numerical references to the captured groups ($1, $2, $4)(but I'm almost halfway to the N=9 apache limit). So this isn't a show stopper for me.
I would just like to know, if named backreferences are available in Apache's mod_rewrite, and if they are, why does my RewriteRule's pattern not match?
Thanks,
Ian
THis might be useful:
https://httpd.apache.org/docs/trunk/rewrite/rewritemap.html
If #superspace's latest answer doesn't work, what I would suggest is routing all links that are not to direct files/directories and route them to an index page. Then setup a routing class which takes in the page name and does manual matching, so you can have your named capture regex array and list the templates or pages you want to feed.
If you have to go this way, let me know and I can offer some code from my classes.
No backreferences it seems, after looking into the mod_rewrite source.
I'd recommend using the RewriteMap option anyway instead of a long list of RewriteRules, as it will be much faster than iterating through a lengthy list.

RegEx match replace help

I am trying to do a regex match and replace for an .htaccess file but I can't quite figure out the replace bit. I think I have the match part right, but maybe someone can help.
I have this url-
http://www.foo.com/MountainCommunities/Lifestyles/5/VacationHomeRentals.aspx
And I'm trying to turn it into this-
http://www.foo.com/mountain-lifestyle/Vacation-Home-Rentals.aspx
(MountainCommunities/Lifestyles)/\d/(.*)(.aspx)
and then I figured I would have a rewrite rule starting like this-
mountain-lifestyle/$2$3
but I need to take what is in $2 in this instance and rewrite it to place dashes between the words with capital letters. Now I'm stumped.
I think you'll have to do it in two bits... Take out $2, precede every capital (apart from the first) with a -, then use just append the result to http://www.foo.com/mountain-lifestyle/ with a .aspx on the end.
Try this:
RewriteRule ^(([A-Z][a-z]+-)*)([A-Z][a-z]+)(([A-Z][a-z]+)+)(\.aspx)?$ /$1$3-$4 [N]
RewriteRule ^([A-Z][a-z]+-)+[A-Z][a-z]+$ /$0.aspx [R=301]
Note that mod_rewrite uses an internal counter to detect and avoid infinit loops. So your URL may not contain too much words having to be converted (see MaxRedirects option for Apache < 2.1 and LimitInternalRecursion directive for Apache ≥ 2.1).
I don't think what your doing with the capital letters is possible with regex...
You would be better keeping the dashes in the URL and removing the .aspx
eg: http://www.foo.com/MountainCommunities/Lifestyles/5/Vacation-Home-Rentals
This would require the following rule:
^/MountainCommunities/Lifestyles/5/([^/]+)/\?([^/]+) /mountain-lifestyle/$1.aspx?$2 [I]
This also takes into account any querystrings that are sent to the page.
BTW: How are you using .htaccess with IIS?
You can use the regular expression "([A-Z])" on the middle bit "VacationHome", replacing with the regex "-$1" - This will give you "-Vacation-Home-Rentals" - Then you can just chop off the first character, and stick the first part of the URL on the front, and .aspx on the end.
I think the main regex has been written by others, but to match the request name to place dashes (assuming all the file names have a three-name camel cased representation ala 'VacationHomeRentals.aspx':
RewriteRule: ^/MountainCommunities/Lifestyles/\d+/([A-Z][a-z]+)([A-Z][a-z]+)([A-Z][a-z]+)\.aspx$ /mountain-lifestyle/$1-$2-$3.aspx
This is a restricted version of #Gumbo's response, as I have not had a chance to test his recursion. The recursion technique is definitely the best and most usable for any scenario.
I don't think I quite understand what you are trying to do. Why can't you simply search for:
http://www.foo.com/MountainCommunities/Lifestyles/5/VacationHomeRentals.aspx
and replace it with:
http://www.foo.com/mountain-lifestyle/Vacation-Home-Rentals.aspx ?
Or is this a specific example of a patten you are trying to transform?