How can I canonicalize URLs in my .htaccess? - regex

I have a Wordpress installation on a LAMP stack, and if I have a post at http://example.com/abc/ , I would like URLs like http://example.com/abc/def.html to be redirected to http://example.com/abc/ . (Note that the slot here occupied by "def" should be without any slashes; this means among other things that things under http://example.com/wp-content/ should be unhindered.)
The rewrite I tried is:
RewriteRule ^(/[^/]+/)[^/]+\.html$ $1 [R=301,L]
As far as I can tell, that says, "Take the first two slashes and everything between them, matching on no more slashes and ending in .html, and redirect to the first captured group." However, with that in place, I can access http://example.com/abc/ , but I get a 404 on attempted access to http://example.com/abc/def.html .
What should I be doing to put the desired redirect behavior in place?
Thanks,

Try this rule:
RewriteRule ^/?([^/]+/)[^/.]+\.html$ /$1 [NC,R=301,L]
make leading slash optional as .htaccess doesn't have it and tweak part after first slash. Make sure this is your very first rule.

Related

Using variables as folder names in .htaccess file gives 500 internal server error

When a user requests /geo/anchorage.json from my server, I'm trying to have it provide data from /geo/a/n/c/anchorage.json
I have this rule written in my .htaccess file, but it's causing a 500 internal server error.
RewriteRule ^geo/((.)(.)(.).+)\.json /geo/$2/$3/$4/$1.json [QSA,L]
I've broken down the rule into parts, testing the first part with a php script to output the parameters, and that worked fine.
RewriteRule ^geo/((.)(.)(.).+)\.json /geo/test.php?2=$2&3=$3&4=$4&1=$1 [QSA,L]
It seems like it's the last part that's causing the error, but I can't find what I'm doing wrong. I've verified that /geo/a/n/c/anchorage.json exists on the server. Is there anything special when you use variables as folders?
The resulting URL /geo/a/n/c/anchorage.json also matches the input regex (^geo/((.)(.)(.).+)\.json), so you'll get a rewrite loop (500 error). You can avoid the rewrite loop by being more specific in your regex. eg. Instead of matching any character (.) you could match anything that is not a slash ([^/]).
In other words, try the following:
RewriteRule ^geo/((.)([^/])([^/])[^/.]+)\.json$ /geo/$2/$3/$4/$1.json [QSA,L]
I left the first capturing group as a . (dot) since that couldn't be a slash anyway.
You may use this rule to fix your issue:
RewriteRule ^(geo)/((\w)(\w)(\w).*\.json)$ $1/$3/$4/$5/$2 [NC,L]
There is no need to use QSA flag as you're not modifying query string.

.htaccess redirect still goes to 404

I have the following rewrite in my .htaccess file, it is still landing on a 404 instead of redirecting.
RewriteCond %{QUERY_STRING} tab=auto_data(.*)$
RewriteRule ^(.*)$ https://test.example.com/automobile-data/ [L,R=301]
There are multiple pages that can possibly have the tab=auto_data query string parameter, and there are quite possibly other QSP appended behind tab=auto_data as well.
I need to redirect any URL that contains the QSP of tab=auto_data to a new page in the site. The domain would remain the same, just the page name is changing.
What am I doing wrong here?
The only other directives are the standard WordPress directives.
In that case, your external redirect should come before any WordPress routing directives. The RewriteEngine directive only needs to appear once, anywhere, in the file. Although it is obviously more logical if it occurs once at the top.
You also need to remove the query string from the substitution, otherwise you'll get a redirect loop since the domain is the same. If the domain/host remains the same then this can be omitted from the substitution.
Try the following:
RewriteCond %{QUERY_STRING} tab=auto_data
RewriteRule ^polk/$ /automobile-data/? [R=301,L]
This specifically matches only the URL-path /polk/ (as mentioned in comments), unless this needs to be more general? And tab=auto_data must match anywhere in the query string.
The ? on the end of the substitution removes the query string and so prevents a redirect loop. (Presumably the query string should be removed from the target?) Although since the source and target URL paths are different, this is not strictly necessary anymore.
If the "domain remains the same", then there is no specific need to include the scheme and host in the substitution. Unless you are hosting multiple domains etc.?
Make sure the browser cache is cleared before testing as 301s are notorious for caching. Testing with 302s can be preferable for this reason.
UPDATE: To specifically remove this query string parameter, but copy the remaining query string onto the target, try something like:
RewriteCond %{QUERY_STRING} ^tab=auto_data(?:&(.+))?
RewriteRule ^polk/$ /automobile-data/?%1 [R=301,L]
(?:&(.+))? - grabs any remaining query string (if any), but excludes the & prefix (param delimiter) from the captured group. %1 is a backreference to this captured group.

match multiple slashes in url, but not in protocol

i try to catch multiple slashes at the ende or inside of url, but not those slashes (which could be two or more), which are placed after protocol (http://, https://, ftp:// file:///)
i tried many findings in similar SO-threads, like ^(.*)//+(.*)$ or [^:](\/{2,}), or ^(.*?)(/{2,})(.*)$ in http://www.regexr.com/, https://regex101.com and http://www.regextester.com/. But nothing worked clear for me.
Its pretty weird, that i can't find a working example - this goal isn't such rar. Could somebody share a working regex?
Here is a rule that you can use in your site root .htaccess to strip out multiple slashes anywhere from input URLs:
RewriteEngine On
RewriteCond %{THE_REQUEST} //
RewriteRule ^.*$ /$0 [R=301,L,NE]
THE_REQUEST variable represents original request received by Apache from your browser and it doesn't get overwritten after execution of some rewrite rules. Example value of this variable is GET /index.php?id=123 HTTP/1.1.
Pattern inside RewriteRule automatically converts multiple slashes into single one.

What does (|/)$ do with mod_rewrite?

I've seen examples in htaccess files using mod_rewrite where everything is done through one php file and different URLs are redirected back to index php.
RewriteRule ^registration(|/)$ /index.php
I'm curious as to what (|/)$ does/is. I've read a lot of stuff and can't seem to find any mention of the use of a vertical bar in mod_rewrite and if I remove this, the redirect still works fine.
The vertical bar stands for a logical OR, and lets you specify either a trailing slash after 'registration' or not.
I prefer using a '?' after the slash, making it optional:
RewriteRule ^registration/?$ /index.php

Apache RewriteRule for removing tailing slash only at root

I know there are many questions about Apache RewriteRules, especially for removing trailing slashes. I have looked at tons but I can't seem to find anyone trying to solve this problem.
I am using Magento, so the URL structure looks like this:
example.com/index.php
example.com/index.php/
example.com/index.php/page1/
Here is my ideal URL structure:
example.com
example.com/page1/
example.com/page2/
So basically I just want to strip the index.php AND make sure the naked domain does not have a trailing slash (example.com instead of example.com/). Also, I would like to NOT include the hardcoded domain name if possible so that the rewrite can be applied in different environments.
Here is my current Rewrite...
RewriteCond %{REQUEST_URI} ^/index\.php/?
RewriteRule ^index.php/(.*) /$1 [R=301,L]
This seems to work in all situations, except for:
example.com/index.php (doesn't work at all)
example.com/index.php/ (leaves the trailing slash)
I would appreciate any regex advice! Thank you.
UPDATE
Thanks to the answer below from #zx81 I have successfully stripped all URLs down to the root domain, but still can't remove the slash.
So here is the current URL: example.com/
And I can't remove the trailing slash!
Not able to test it live, but try this.
RewriteRule ^index\.php()/?(?:([^/]+)/)? $1$2 [R=301,L]
It should handle the one that doesn't work at all thanks to the empty capturing group 1 ().
In PCRE (Apache's regex flavor) this also strips the trailing slash, but Apache may decide to add it back.