More efficient RewriteRule for messy URL - regex

I have developed a new web site to replace an existing one for a client. Their previous site had some pretty nasty looking URLs to their products. For example, an old URL:
http://mydomain.com/p/-3-0-Some-Ugly-Product-Info-With-1-3pt-/a-arbitrary-folder/-18pt/-1-8pt-/ABC1234
I want to catch all requests to the new site that use these old URLs. The information I need out of the old URL is the ABC1234 which is the product ID. To clarify, the old URL begins with /p/ followed by four levels of folders, then the product ID.
So for example, the above URL needs to get rewritten to this:
http://mydomain.com/shop/?sku=ABC1234
I'm using Apache 2.2 on Linux. Can anyone point me on the correct pattern to match? I know this is wrong, but here is where I am currently at:
RewriteRule ^p/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)?$ shop/?sku=$5 [R=301,NC,L]
I'm pretty sure that the pattern used to match each of the 4 folders is redundant, but I'm just not that sharp with regex. I've tried some online regex evaluators with no success.
Thank you.
--EDIT #1--
Actually, my RewriteRule above does work, but is there a way to shorten it up?
--EDIT #2--
Thanks to ddr, I've been able to get this expression down to this:
RewriteRule ^p/([\w-]+/){4}([\w-]+)$ shop/?_sku=$2 [R=301,NC,L]
--EDIT #3--
Mostly for the benefit of ddr, but I welcome anyone to assist who can. I'm dealing with over 10,000 URLs that need to be rewritten to work with a new site. The information I've provided so far still stands, but now that I am testing that all of the old URLs are being rewritten properly I am running into a few anomolies that don't work with the RewriteRule example provided by ddr.
The old URLs are consistent in that the product ID I need is at the very end of the URL as documented above. The first folder is always /p/. The problem I am running into at this point is that some of the URLs have a URL encoded double quote ("). Additionally, some of the URLs contain a /-/ as one of the four folders mentioned. So here are some examples of the variations in the old URLs:
/p/-letters-numbers-hyphens-88/another-folder/-and-another-/another-18/ABC1234
/p/-letters-numbers-hyphens-88/2%22/-/-/ABCD1234
/p/letters-numbers-hyphens-1234/34-88/-22/-/ABCD1234/
Though the old URLs are nasty, I think it is safe to say that the following are always true:
Each begins with /p/
Each ends with the product ID that I need to isolate.
There are always four levels of folders between /p/ and the product ID.
Some folders in between have hyphens, some don't.
Some folders in between are a hyphen ONLY.
Some folders in between contain a % character where they are URL encoded.
Some requests include a / at the very end and some do not.
The following rule was provided by ddr and worked great until I ran into the URLs that contain a % percent sign or a folder with only a hyphen:
RewriteRule ^p/(?:[\w-]+/){4}([\w-]+)$ shop/?_sku=$1 [R=301,NC,L]
Given the rule above, how do I edit it to allow for a folder that is hyphen only (/-/) or for a percent sign?

You can use character classes to reduce some of the length. The parentheses (capture groups) are also unnecessary, except the last one, as #jpmc26 says.
I'm not especially familiar with Apache rules, but try this:
RewriteRule ^p/(?:[\w-]+/){4}([\w-]+)$ shop/?sku=$1 [R=301,NC,L]
It should work if extended regular expressions are supported.
\w is equivalent to [A-Za-z0-9_] and you don't need to not capture underscores, so that's one replacement.
The {4} matches exactly four repetitions of the previous group. This is not always supported so Apache may not like it.
The ?: is optional but indicates that these parens should not be treated as a capture. Makes it slightly more efficient.
I'm not sure what the part in [] at the end is for but I left it. I can't see why you'd need a ? before the $, so I took it out.
Edit: the most compact way, if Apache likes it, would probably be
RewriteRule ^p(/[\w-]+){5}$ shop/?sku=$5 [R=301,NC,L]
EDIT: response to edit 3 of the question.
I'm surprised it doesn't work with only -. The [\w-]+ should match even where there is just a single -. Are you sure there isn't something else going on in these URLs?
You might also try replacing - in the regex with \-.
As for the %, just change [\w-] to [\w%-]. Make sure you leave the - at the end! Otherwise the regex engine will try to interpret it as part of a char sequence.
EDIT 2: Or try this:
RewriteRule ^p/(?:.*?/){4}(.*?)/?$ shop/?sku=$1 [R=301,NC,L]

Related

How to only show id value on url path with htaccess?

What I have right now is
https://www.example.com/link.php?link=48k4E8jrdh
What I want to accomplish is to get this URL instead =
https://www.example.com/48k4E8jrdh
I looked on the internet but with no success :(
Could someone help me and explain how this works?
This is what I have right now (Am I in the right direction?)
RewriteEngine On
RewriteRule ^([^/]*)$ /link.php?link=$1
RewriteRule ^([^/]*)$ /link.php?link=$1
This is close, except that it will also match /link.php (the URL being rewritten to) so will result in an endless rewrite-loop (500 Internal Server Error response back to the browser).
You could avoid this loop by simply making the regex more restrictive. Instead of matching anything except a slash (ie. [^/]), you could match anything except a slash and a dot, so it won't match the dot in link.php, and any other static resources for that matter.
For example:
RewriteRule ^([^/.]*)$ link.php?link=$1 [L]
You should include the L flag if this is intended to be the last rule. Strictly speaking you don't need it if it is already the last rule, but otherwise if you add more directives you'll need to remember to add it!
If the id in the URL should only consist of lowercase letters and digits, as in your example, then consider just matching what is needed (eg. [a-z0-9]). Generally, the regex should be as restrictive as required. Also, how many characters are you expecting? Currently you allow from nothing to "unlimited" length.
Just in case it's not clear, you do still need to change the actual URLs you are linking to in your application to be of the canonical form. ie. https://www.example.com/48k4E8jrdh.
UPDATE:
It works but now the site always sees that page regardless if it is link.php or not? So what happens now is this: example.com/idu33d3dh#dj3d3j And if I just do this: example.com then it keeps coming back to link.php
This is because the regex ^([^/.]*)$ matches 0 or more characters (denoted by the * quantifier). You probably want to match at least one (or some minimum) of character(s)? For example, to match between 1 and 32 characters change the quantifier from * to {1,32}. ie. ^([^/.]{1,32})$.
Incidentally, the fragment identifier (fragid) (ie. everything after the #) is not passed to the server so this does not affect the regex used (server-side). The fragid is only used by client-side code (JavaScript and HTML) so is not strictly part of the link value.

htaccess - redirect string containing part of specific string

UPDATED - my initial question wasn't quite correct. (apologies to all concerned)
UPDATED again - (this is not my day today..)
I need to redirect all incoming image requests for:
http://www.example.com/images/asd12catalog.jpg (there is an additional alpha character)
To:
http://www.example.com/images/as-d12.jpg (I have added the "-")
So I need to strip out the word catalog and change the first portion of the filename to add a "-" making as-d12.jpg.
I have tried variations on:
RewriteRule ^/images/[a-z0-9]catalog.jpg$ /images/$1.jpg
But I just can't seem to get a match.
Can anyone help please?
Thanks.
Your attempt was very close, the only major problem being that you did not actually wrap anything in your regex as a capture group. By placing parentheses around [a-zA-Z]*[0-9]* below, it will be available in the variable $1 after the match has finished. You can then use this as you expected in your redirect URL.
RewriteRule ^/images/([a-zA-Z]{2})([a-zA-Z]{1})([0-9]*)catalog.jpg$ /images/$1-$2$3.jpg
Demo:
Regex101
RewriteRule ^/?images/([a-zA-Z]{2})([a-zA-Z]{1})([0-9]+)catalog.jpg$ /images/$1-$2$3.jpg
You're not specific about the exact format of your filenames, but this will match anything followed by catalog.jpg, which will hopefully cover any requirements.
Also note that the leading / should at most be optional when matching in rewrite rules - they haven't been part of the path parsed by RewriteRule since version 1. See https://webmasters.stackexchange.com/questions/27118/when-is-the-leading-slash-needed-in-mod-rewrite-patterns
Edit: updated again for new requirement

regex to remove the querystring and match the remaining segments from the url

npinti helped me create a regex to remove the querystring and match the remaining segments from the url /seattle/restaurant/sushi?page=2: "Something like so should yield 3 groups: /(.*?)/(restaurant)/([^?]+).*. Group 1 being seatthe, group 2 being restaurant and group 3 being sushi. If after the last /there is a ?, the regex discards the ? and everything which follows."
I have tried modifying the above to do the same trick on the url /seattle/restaurant?page=2 but I could not get it right. I don't know if there will be af querystring or not or the parameters of the querystring. So I need the flexibility from the regex above which will match and discard the ? and everything which follows.
Your rewriterules may look like:
RewriteRule /([^/]+)/restaurant/([^/]+)$ mynewpage.php?group1=$1&group2=$2 [QSA,NC,L]
Your may search for what QSA, NC, and L mean thanks to the links I provide below.
I'm sorry but your question sound very like "I'm not very good, so can someone do the job for me?". I mean, just look around, you'll get a lot of answer, just get your hands dirty.
Here's the wiki of serverfault.com
The howto's htaccess official guide
The official mod_rewrite guide
And if that's not enough:
Two hints:
If you're not in a hosted environment (= if it's your own server and you can modify the virtual hosts, not only the .htaccess files), try to use the RewriteLog directive: it helps you to track down such problems:
# Trace:
# (!) file gets big quickly, remove in prod environments:
RewriteLog "/web/logs/mywebsite.rewrite.log"
RewriteLogLevel 9
RewriteEngine On
My favorite tool to check for regexp:
http://www.quanetic.com/Regex (don't forget to choose ereg(POSIX) instead of preg(PCRE)!)
This should allow you to match all text prior to the ? and nothing else. It will match everything if no ? is present:
[^?]*
Is that all you need to do? Because /(.*?)/(restaurant)/([^?]+).*. looks like it's designed to do something significantly more complicated.

causing headache in RewriteRule

I am struggling with a very basic regex problem in my .htaccess file that I hope someone may be able to shed some light on. The basic premise is that I would like to teach Apache to switch any .html extension into a .var extension. I had thought that the rule would be positively trivial:
RewriteRule ^([^.]+)\.html$ $1.var
But the [^.] part simply doesn't work. Bizarrely, it works like so
RewriteRule ^([^A-Z]+)\.html$ $1.var
I do not understand why this latter rule works. Assume I am looking for a file called "index.html" then $1 should match to "index." and the ".html" bit should actually fail to match.
To widen the scope of the question slightly, I am actually racking my brain on how to implement a multi-lingual site. I don't like Apache's MultiView option because it forces upon me a flat directory structure with file extensions that aren't recognizable to many development tools. I could go the .var type-map route but am finding that the default config for Apache doesn't support this all that well either (hence my excursions into regex land). So while I am using mod_rewrite, I am thinking that I might go the whole hog: whenever a request for a name.html file is received and this file does not exist, check whether there exists a XX/name.html file instead, where "XX" is the language code according to the user's preferences.
This would give me a neater directory structure, though it does perhaps not perform as well as the .var approach in a situation where the language preference of the user's browser is not supported in by my site (in which situation .var would substitute EN or similar).
Any thoughts? Thanks.
Why don't you just use ^(.*)\.html$? This will match any string that ends in .html. After all, filenames can contain more than one dot.
[^A-Z]+ matches index if the regex is applied case-sensitively. Perhaps that's the reason? Why [^.]+ should fail is beyond me, though.
The . matches everything but newlines.
Inside of a character class, the ^ means "not".
The + means one or more of the preceding character class.
So when you write ([^.]+), that says "match one or more newlines". So unless you have a URL composed of newlines followed by ".html", this will not work.
^([^A-Z]+)\.html$ works because it matches one or more characters that are not uppercase letters. If you have any uppercase letters before the ".html" in your URL, this one will fail too.
Tim Pietzcker's suggestion is correct: just use ^(.*)\.html$,keeping in mind that this won't work in the odd case that you have newlines in your URL.
In the odd case that you actually have URL's with newlines in them, you can use ^([\d\D]+)\.html$, which will match digits and non-digits (i.e. everything) up until the ".html".

Regex to match a URL pattern for a htaccess file

I'm really struggling with this regex.
We're launching a new version of our "groups" feature, but need to support all the old groups at the same URL. Luckily, these two features have different url patterns, so this should be easy? Right?
Old: domain.com/group/$groupname
New: domain.com/group/$groupid/$groupname
The current regex I was using in the regex to redirect old groups to the old script was: RewriteRule ^(group/([^0-9/].*))$ /old_group.php/$1 [L,QSA]
It was working great until a group popped up called something like: domain.com/group/2009-neato-group
At which point the regex matched and forced the new group to the old code because it contained numbers. Ooops.
So, any thoughts on a regex that would match the old pattern so I can redirect it?
Patterns it must match:
group/my-old-group
group/another-old-group/
group/123-another-old-group-456/a-module-inside-the-group/
Patterns it cannot match:
group/12/a-new-group
group/345/another-new-group/
group/6789/a-3rd-group/module-abc/
The best way I've thought about it is that it cannot start with "group-slash-numbers-slash", but haven't been able to turn that into a regex. So, something to negate that pattern would probably work.
Thanks
Think the other way around: If the url matches ^group/[0-9]+/.+, then it is a new group, so redirect these first (with a [L] rule). Everything else must be a old group, you can keep your old rules there.
Your design choices make a pure-htaccess solution difficult, especially since group/123-another-old-group-456/a-module-inside-the-group/ is a valid old-style URL.
I'd suggest a redirector script that looks at its arguments, determines if the first part is a valid groupid, and if so, it's a new-style URL takes the user to that group. Else, it is a old-style URL, so hand it off to old_group.php. Something like this:
RewriteRule ^group/(.*)$ /redirector.php/$1 [L,QSA]