RegEx match replace help - regex

I am trying to do a regex match and replace for an .htaccess file but I can't quite figure out the replace bit. I think I have the match part right, but maybe someone can help.
I have this url-
http://www.foo.com/MountainCommunities/Lifestyles/5/VacationHomeRentals.aspx
And I'm trying to turn it into this-
http://www.foo.com/mountain-lifestyle/Vacation-Home-Rentals.aspx
(MountainCommunities/Lifestyles)/\d/(.*)(.aspx)
and then I figured I would have a rewrite rule starting like this-
mountain-lifestyle/$2$3
but I need to take what is in $2 in this instance and rewrite it to place dashes between the words with capital letters. Now I'm stumped.

I think you'll have to do it in two bits... Take out $2, precede every capital (apart from the first) with a -, then use just append the result to http://www.foo.com/mountain-lifestyle/ with a .aspx on the end.

Try this:
RewriteRule ^(([A-Z][a-z]+-)*)([A-Z][a-z]+)(([A-Z][a-z]+)+)(\.aspx)?$ /$1$3-$4 [N]
RewriteRule ^([A-Z][a-z]+-)+[A-Z][a-z]+$ /$0.aspx [R=301]
Note that mod_rewrite uses an internal counter to detect and avoid infinit loops. So your URL may not contain too much words having to be converted (see MaxRedirects option for Apache < 2.1 and LimitInternalRecursion directive for Apache ≥ 2.1).

I don't think what your doing with the capital letters is possible with regex...
You would be better keeping the dashes in the URL and removing the .aspx
eg: http://www.foo.com/MountainCommunities/Lifestyles/5/Vacation-Home-Rentals
This would require the following rule:
^/MountainCommunities/Lifestyles/5/([^/]+)/\?([^/]+) /mountain-lifestyle/$1.aspx?$2 [I]
This also takes into account any querystrings that are sent to the page.
BTW: How are you using .htaccess with IIS?

You can use the regular expression "([A-Z])" on the middle bit "VacationHome", replacing with the regex "-$1" - This will give you "-Vacation-Home-Rentals" - Then you can just chop off the first character, and stick the first part of the URL on the front, and .aspx on the end.

I think the main regex has been written by others, but to match the request name to place dashes (assuming all the file names have a three-name camel cased representation ala 'VacationHomeRentals.aspx':
RewriteRule: ^/MountainCommunities/Lifestyles/\d+/([A-Z][a-z]+)([A-Z][a-z]+)([A-Z][a-z]+)\.aspx$ /mountain-lifestyle/$1-$2-$3.aspx
This is a restricted version of #Gumbo's response, as I have not had a chance to test his recursion. The recursion technique is definitely the best and most usable for any scenario.

I don't think I quite understand what you are trying to do. Why can't you simply search for:
http://www.foo.com/MountainCommunities/Lifestyles/5/VacationHomeRentals.aspx
and replace it with:
http://www.foo.com/mountain-lifestyle/Vacation-Home-Rentals.aspx ?
Or is this a specific example of a patten you are trying to transform?

Related

how to consider regex expression as block to apply lookbehids and lookaheads?

I'm trying to turn a string of this type:
http://example.com/mypage/272-16+276-63+350-02
where aaa-bb are product codes and their numbers may vary from 2 to anything, but I doubt there will ever be more than 5 into:
http://example.com/mypage/272-16+276-63+350-02/?skus=272-16+276-63+350-02
using a redirect match. I'm fairly new to regular expressions and I don't seem to be able to get the negative lookahead and lookbehind to work the way I want.
To capture the string the first time is fairly easy, I used ([\-\+0-9]+) but I don't want it to match on redirection (when I already have a ? in my link). Using ([\-\+0-9]+)(?!\?)(?<\?) doesn't do the trick, it only excludes my last digit from the match. So, is there a way to make regex consider all my product codes as one block, so I can than check if there is a question mark before or after it?
Thank you for looking into this.
You can't mix mod_rewrite (RewriteCond) and mod_alias (RedirectMatch) together. You need to stick with one or the other and you can't match the query string with a RedirectMatch, so you're using mod_rewrite:
RewriteEngine On
RewriteCond %{QUERY_STRING} !skus=
RewriteRule ^mypage/([\-\+0-9]+)$ /mypage/$1?skus=$1 [L,R=301]
Maybe try (?<=http://example.com/mypage/)[0-9+-]+$ Will match only the first case.

More efficient RewriteRule for messy URL

I have developed a new web site to replace an existing one for a client. Their previous site had some pretty nasty looking URLs to their products. For example, an old URL:
http://mydomain.com/p/-3-0-Some-Ugly-Product-Info-With-1-3pt-/a-arbitrary-folder/-18pt/-1-8pt-/ABC1234
I want to catch all requests to the new site that use these old URLs. The information I need out of the old URL is the ABC1234 which is the product ID. To clarify, the old URL begins with /p/ followed by four levels of folders, then the product ID.
So for example, the above URL needs to get rewritten to this:
http://mydomain.com/shop/?sku=ABC1234
I'm using Apache 2.2 on Linux. Can anyone point me on the correct pattern to match? I know this is wrong, but here is where I am currently at:
RewriteRule ^p/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)?$ shop/?sku=$5 [R=301,NC,L]
I'm pretty sure that the pattern used to match each of the 4 folders is redundant, but I'm just not that sharp with regex. I've tried some online regex evaluators with no success.
Thank you.
--EDIT #1--
Actually, my RewriteRule above does work, but is there a way to shorten it up?
--EDIT #2--
Thanks to ddr, I've been able to get this expression down to this:
RewriteRule ^p/([\w-]+/){4}([\w-]+)$ shop/?_sku=$2 [R=301,NC,L]
--EDIT #3--
Mostly for the benefit of ddr, but I welcome anyone to assist who can. I'm dealing with over 10,000 URLs that need to be rewritten to work with a new site. The information I've provided so far still stands, but now that I am testing that all of the old URLs are being rewritten properly I am running into a few anomolies that don't work with the RewriteRule example provided by ddr.
The old URLs are consistent in that the product ID I need is at the very end of the URL as documented above. The first folder is always /p/. The problem I am running into at this point is that some of the URLs have a URL encoded double quote ("). Additionally, some of the URLs contain a /-/ as one of the four folders mentioned. So here are some examples of the variations in the old URLs:
/p/-letters-numbers-hyphens-88/another-folder/-and-another-/another-18/ABC1234
/p/-letters-numbers-hyphens-88/2%22/-/-/ABCD1234
/p/letters-numbers-hyphens-1234/34-88/-22/-/ABCD1234/
Though the old URLs are nasty, I think it is safe to say that the following are always true:
Each begins with /p/
Each ends with the product ID that I need to isolate.
There are always four levels of folders between /p/ and the product ID.
Some folders in between have hyphens, some don't.
Some folders in between are a hyphen ONLY.
Some folders in between contain a % character where they are URL encoded.
Some requests include a / at the very end and some do not.
The following rule was provided by ddr and worked great until I ran into the URLs that contain a % percent sign or a folder with only a hyphen:
RewriteRule ^p/(?:[\w-]+/){4}([\w-]+)$ shop/?_sku=$1 [R=301,NC,L]
Given the rule above, how do I edit it to allow for a folder that is hyphen only (/-/) or for a percent sign?
You can use character classes to reduce some of the length. The parentheses (capture groups) are also unnecessary, except the last one, as #jpmc26 says.
I'm not especially familiar with Apache rules, but try this:
RewriteRule ^p/(?:[\w-]+/){4}([\w-]+)$ shop/?sku=$1 [R=301,NC,L]
It should work if extended regular expressions are supported.
\w is equivalent to [A-Za-z0-9_] and you don't need to not capture underscores, so that's one replacement.
The {4} matches exactly four repetitions of the previous group. This is not always supported so Apache may not like it.
The ?: is optional but indicates that these parens should not be treated as a capture. Makes it slightly more efficient.
I'm not sure what the part in [] at the end is for but I left it. I can't see why you'd need a ? before the $, so I took it out.
Edit: the most compact way, if Apache likes it, would probably be
RewriteRule ^p(/[\w-]+){5}$ shop/?sku=$5 [R=301,NC,L]
EDIT: response to edit 3 of the question.
I'm surprised it doesn't work with only -. The [\w-]+ should match even where there is just a single -. Are you sure there isn't something else going on in these URLs?
You might also try replacing - in the regex with \-.
As for the %, just change [\w-] to [\w%-]. Make sure you leave the - at the end! Otherwise the regex engine will try to interpret it as part of a char sequence.
EDIT 2: Or try this:
RewriteRule ^p/(?:.*?/){4}(.*?)/?$ shop/?sku=$1 [R=301,NC,L]

Regex for simple urls

I am looking for regex for simple URLs as
http://www.google.com
http://www.yahoo.in
http://www.example.eu
http://www.example.net
etc.
No subdirectories allowed. For example in this cases it must not validate http://www.google.com/, http://www.yahoo.in/mail.
Does anyone know any regex to do this?
I'm still a noob, but try this:
^http:\/\/[a-zA-Z0-9_\-]+\.[a-zA-Z0-9_\-]+\.[a-zA-Z0-9_\-]+$
This one should do:
^(https?:\/\/)?[0-9a-zA-Z]+\.[-_0-9a-zA-Z]+\.[0-9a-zA-Z]+$
This should work for URLs starting with http:// or https:// or without the protocol name.
The regex should also be used as case-insensitive. In that case, it can be shortened a bit:
^(https?:\/\/)?[0-9a-z]+\.[-_0-9a-z]+\.[0-9a-z]+$
If you don't care whether it is a valid url, you can use:
\S*www\.\S+
All the examples contain www. followed by a nonspace character, but that is unlikely to occur in a normal word.

causing headache in RewriteRule

I am struggling with a very basic regex problem in my .htaccess file that I hope someone may be able to shed some light on. The basic premise is that I would like to teach Apache to switch any .html extension into a .var extension. I had thought that the rule would be positively trivial:
RewriteRule ^([^.]+)\.html$ $1.var
But the [^.] part simply doesn't work. Bizarrely, it works like so
RewriteRule ^([^A-Z]+)\.html$ $1.var
I do not understand why this latter rule works. Assume I am looking for a file called "index.html" then $1 should match to "index." and the ".html" bit should actually fail to match.
To widen the scope of the question slightly, I am actually racking my brain on how to implement a multi-lingual site. I don't like Apache's MultiView option because it forces upon me a flat directory structure with file extensions that aren't recognizable to many development tools. I could go the .var type-map route but am finding that the default config for Apache doesn't support this all that well either (hence my excursions into regex land). So while I am using mod_rewrite, I am thinking that I might go the whole hog: whenever a request for a name.html file is received and this file does not exist, check whether there exists a XX/name.html file instead, where "XX" is the language code according to the user's preferences.
This would give me a neater directory structure, though it does perhaps not perform as well as the .var approach in a situation where the language preference of the user's browser is not supported in by my site (in which situation .var would substitute EN or similar).
Any thoughts? Thanks.
Why don't you just use ^(.*)\.html$? This will match any string that ends in .html. After all, filenames can contain more than one dot.
[^A-Z]+ matches index if the regex is applied case-sensitively. Perhaps that's the reason? Why [^.]+ should fail is beyond me, though.
The . matches everything but newlines.
Inside of a character class, the ^ means "not".
The + means one or more of the preceding character class.
So when you write ([^.]+), that says "match one or more newlines". So unless you have a URL composed of newlines followed by ".html", this will not work.
^([^A-Z]+)\.html$ works because it matches one or more characters that are not uppercase letters. If you have any uppercase letters before the ".html" in your URL, this one will fail too.
Tim Pietzcker's suggestion is correct: just use ^(.*)\.html$,keeping in mind that this won't work in the odd case that you have newlines in your URL.
In the odd case that you actually have URL's with newlines in them, you can use ^([\d\D]+)\.html$, which will match digits and non-digits (i.e. everything) up until the ".html".

Regular Expression to match multiple query string parameter/value pairs

About to work through this one, but thought someone may have already had to tackle it, so...
I'm looking for an elegant (and isapi rewrite compatible) regular expression to look for three known parameter/value pairs in a querystring, regardless of order, and also extract all other parameters while stripping out those three.
abc=123 def=456 and ghi=789 are all known, fixed strings. They may appear in any order in the querystring, and may or may not be the only parameters, may or may not be adjacent. It should be smart and not match aaabc=123 or abc=1234 (so each searched parameter should be bracketed by &, ?, #, or end of string). The output I want is a new query string with the remaining params stripped out.
I'll probably be taking a stab at the logic in the morning, so bonus points if you can solve it before I try to then.
I think regexes shouldn't be used for problems of this type. Just tokenize the string, and compare every parameter's name to what you are looking for.
s/(\?|\#|\&)(abc=123|def=456|ghi=789)(\&|\#|$)//g
This is approximate and untested, but presents a working (I think) concept. Basically, look for starting border, literal string, then end border, replacing each with null, globally, and using | to give alternate options for each.
Here's what I've come up with:
RewriteRule ^/oldpage.htm\?(.\*)(?<=\?|&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:(?:abc=123|def=456|ghi=789)(?:&|#|$))(.\*) /newpage.htm?$1$2$3 [I,RP,L]
which I think works. the lookAhead/lookbehind qualifiers, (?<= and (?= , seem to be the key to allowing me to look for the encompassing & or ? without "consuming it" to mess up the next match.
One gotcha is that if the old page url only has the three params, I still end up with a trailing ? with no parameters on the redirected url, "/newpage.htm?". I'm currently planning to avoid that by using a RewriteCond to only look at urls with 4+ params before this fires, and have a simpler match regex for the ones with exactly three..so the full ruleset comes out to:
RewriteCond URL ^/oldpage.htm\?([^#]\*=[^#]\*&){3,}[^#]\*=[^#]\*.\*
RewriteRule ^/oldpage.htm\?(.\*)(?<=\?|&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:abc=123&|def=456&|ghi=789&)(.\*)(?<=&)(?:(?:abc=123|def=456|ghi=789)(?:&|#|$))(.\*) /newpage.htm?$1$2$3 [I,RP,L]
RewriteRule ^/oldpage.htm\?(?:abc=123|def=456|ghi=789)&(?:abc=123|def=456|ghi=789)&(?:abc=123|def=456|ghi=789)(.\*) /newpage.htm$1 [I,RP,L]
(the $1 at the end is for #additions to the url...do I really need it?) The other issue is I suppose a url of /oldpage.htm?abc=123&abc=123&abc=123 would trigger this, but I don't see any easy way around that, and am not too worried about it..
Can anyone think of a better way to approach this, or see any other issues?
There are querystring decoders. There are many connected topics, especially on this site.
Some of them.
First
Second
And javadocs link for apache decoder.