Understanding RegEx - SEO Duplication on last term - regex

i have a problem with duplicate pages for SEO on a website i'm trying to fix. www.example.com/category/c1234 loads just the same as www.example.com/category/c1234garbage
I've been reading online and testing the code and so far I narrowed it down to a possible regex problem. I have the following lines
# url rewrites
RewriteCond %{REQUEST_URI} ^/index\.cfm/.+ [NC]
RewriteRule ^/index.cfm/(([^/]+)/?([^/]+)?)/?(.*)? /index.cfm/$4?$2=$3 [NS,NC,QSA,N,E=SESDONE:true]
I added an R in the rule so I could see if it was passing through there and it is and after it passes that the garbage at the end disappears.
Can someone help me understand this and figure out a way to fix it so when you go to www.example.com/category/c1234garbage it redirects to www.example.com/category/c1234
I've been searching online for quite a while now and thought it might be time to post here since I can't seem to find a solution. I'm reading "Mastering Regular Expressions" but it might take take a while for me to find the answers I'm looking for.
I appreciate any help you can give me. Thank you.
EDIT: This is what i have before that
RewriteEngine On
Rewritebase /
# remove trailing index.cfm
RewriteCond %{QUERY_STRING} ^$
RewriteRule ^index.cfm(\?)?$ / [R=301,L]
# remove trailing slash
RewriteCond %{QUERY_STRING} ^$
RewriteRule (.*)/$ /$1 [R=301,L]
# Remove trailing ?
RewriteCond %{THE_REQUEST} \?\ HTTP [NC]
RewriteRule ^/?(index\.cfm)? /? [R=301,L]
# SEF URLs
SetEnv SEF_REQUEST false
RewriteRule ^[a-z\d\-]+/[a-z]\d+/? /index.cfm/$0 [NC,PT,QSA,E=SEF_REQUEST:true]
RequestHeader add SEF-Request %{SEF_REQUEST}e
RewriteCond %{HTTP:SEF_REQUES} ^true$ [NC]
RewriteRule . - [L]
EDIT: I was reading the htaccess again and found this that I don't understand but it might have some connection. It's located at the bottom of the file.
# lowercase the hostname, and set the TLD name to an enviroment variable
RewriteCond ${lowercase:%{SERVER_NAME}|NONE} ^(.+)$
RewriteCond %1 ^[a-z0-9.-]*?[.]{0,1}([a-z0-9-]*?\.[a-z.]{2,6})$
RewriteRule .? - [E=TLDName:%1]

From your description and your code, it sounds like this is the transformation that's happening here:
www.example.com/category/c1234garbage
↓
www.example.com/index.cfm?category=c1234garbage
So the problem, I think, is not your rewriting rules. The problem is how you're handling querystring parameters on the server side. If you have an actual page called index.cfm that's interpreting those parameters, you should tweak the code behind that page to validate them and redirect to /category/c1234 where appropriate.
I think the code in index.cfm is looking at the parameter, checking to see if it starts with something recognizable, and going from there. You need to make it more strict.
Alternatively, you could add another .htaccess rule to parse the c1234garbage part and decide which part is valid, and which part (if any) is garbage. I can't give you a regex for that, though, since I don't know the rules for a valid input in your application.
Edit:
I think I found the problem. This part here:
RewriteRule ^[a-z\d\-]+/[a-z]\d+/? /index.cfm/$0 [NC,PT,QSA,E=SEF_REQUEST:true]
You specify the beginning of the relative URL with ^, but you don't specify that you want it to match all the way to the end. So I think what's happening is that it's taking the part of the string that matches, throwing out everything else, and appending it to /index.cfm/. So it takes only the /category/c1234 part from /category/c1234garbage, because that's the part that matches ^[a-z\d\-]+/[a-z]\d+/?.
You can probably fix this with just a word break:
RewriteRule ^[a-z\d\-]+/[a-z]\d+\b/? /index.cfm/$0 [NC,PT,QSA,E=SEF_REQUEST:true]
If that doesn't work, I'm afraid we've reached the end of my htaccess knowledge. I'm more of a regex guy.
Just BTW, this still seems a little awkward. If I understand this right, part of the URL will still get thrown out if it doeesn't fit your exact pattern. E.g. /category/c1234?abc=123 will lose its querystring parameters. You might want to redesign how your rules are set up.

I partially solved the problem. I added
# Remove garbage from after category
RewriteCond %{REQUEST_URI} [a-z\d\-]+/[a-z]\d+(.+)
RewriteRule ^([a-z\d\-]+/[a-z]\d+)/? $1 [R=301]
on top of the SEF rules. It's doing what i want which is to remove the garbage from the url but it gives me an infinite loop because its redirecting even when the url is clean. Any hints?
EDIT: So i realized that the .+ at the end is matching the numbers as well... How do i change it to match anything other than numbers after the numbers? basically where I have the .+ i need to have a "match any character except for numbers"
EDIT: I finally got it to work with the following code:
# Remove garbage from after category
RewriteCond %{REQUEST_URI} [a-z\d\-]+/[a-z]\d+[A-Za-z-.]+
RewriteRule ^([a-z\d\-]+/[a-z]\d+)/? $1 [R=301]
The (.+) i was using previously was reading the 2nd number (c1234)as being part of the . so it would always pass the the condition as true unless it was something like c1

Related

How to write expression to grab all after an expression and then rewrite in htaccess

I'm new to the rewriting of urls and regex in general. I'm trying to rewrite a URL to make it a 'pretty url'
The original URL was
/localhost/house/category.php?cat=lounge&page=1
I want the new url to look like this:
/localhost/house/category?lounge&page=1
(like I say, I'm new so not trying to take it too far at the moment)
the closest I've managed to get it to is this:
RewriteRule ^category/(.*)$ ./category.php?cat=$1 [NC,L]
but that copies the whole URL and creates:
/localhost/house/category/house/category/lounge&page=1
I'm sure, there must be an easy way to say copy all after that expression, but I haven't managed to get there yet.
I will try to help you:
You probably have already, but try a mod rewrite generator and htaccess tester.
From this answer: The query (everything after the ?) is not part of the URL path and cannot be passed through or processed by RewriteRule directive without using [QSA].
I propose using RewriteCond and using %1 instead of $1 for query string matches as opposed to doing it all in RewriteRule.
For your solution, try:
RewriteCond %{QUERY_STRING} ^(.*)$
RewriteRule ^house/category$ house/category.php?cat=%1 [NC,L]
This will insert the .php and cat= while retaining the &page=
Anticipating your next step, the below mod rewrite may help get started in converting
http://localhost/house/category/lounge/1
to
http://localhost/house/category.php?cat=lounge&page=1
Only RewriteRule necessary here, no query string:
RewriteRule ^house/category/([^/]*)/([0-9]*)/?$ house/category.php?cat=$1&page=$2 [NC,L]
Use regex101 for more help and detailed description on what these regexes do.
If it still not working, continue to make the regex more lenient until it matches correctly:
Try to remove the ^ in RewriteRule so it becomes
RewriteRule category$ category.php?cat=%1 [NC,L]
Then it will match that page at any directory level. Then add back in house/ and add /? wherever an optional leading/trailing slash may cause a problem, etc.
Thanks for all your suggestions, I took it back to this
RewriteRule category/([^/])/([0-9])/?$ category.php?cat=$1&page=$2 [NC,L]
which has done the trick, and I'll leave it at this for now.

.htaccess redirect is sending whole string, instead of partial

I partially have my .htaccess rule working. What I have currently is:
#tag to search redirect
RewriteCond %{REQUEST_URI} ^/tag\/*
RewriteRule ^(.*) https://www.testurl.co.uk/search-results?hsf=$1&id=12 [R=301,L]
What is currently happening, is where the $1 is, the entire of tag/* is going in there.
i.e request is tag/test URL generated is
https://www.testurl.co.uk/search-results?hsf=tag/test&id=12
when it should ideally be:
https://www.testurl.co.uk/search-results?hsf=test&id=12
Any help would be greatly appreciated.
Many thanks
You can use this rule:
RewriteRule ^tag/(.+)$ https://www.testurl.co.uk/search-results?hsf=$1&id=12 [R=301,L,QSA]
Pattern ^tag/(.+)$ will capture any value after /tag/ into group #1 and that is being used in $1.
Make sure to clear your browser cache before testing this.

.htaccess replace "/" with "_"

This is my current htaccess
<IfModule mod_rewrite.c>
# turn on rewrite engine
RewriteEngine on
# if request is a directory, make sure it ends with a slash
RewriteCond %{REQUEST_FILENAME} -d
RewriteRule ^(.*/[^/]+)$ $1/
# if not rewritten before, AND requested file is wikka.php
# turn request into a query for a default (unspecified) page
RewriteCond %{QUERY_STRING} !wakka=
RewriteCond %{REQUEST_FILENAME} wikka.php
RewriteRule ^(.*)$ wikka.php?wakka= [QSA,L]
# if not rewritten before, AND requested file is a page name
# turn request into a query for that page name for wikka.php
RewriteRule ^(.*)$ wikka.php?wakka=$1 [QSA,L]
</IfModule>
My current url structure is
www.domain.com/site/pool/Page_Example_Test
www.domain.com/site/pool/Page_Example_Test/edit
www.domain.com/site/pool/Page_Example_Test/edit?id=1
www.domain.com/site/pool/Page_Example_Test/history
www.domain.com/site/pool/Page_Number_Room
www.domain.com/site/pool/Page_Number_Room/edit
How Is it possible to access them like this
www.domain.com/site/pool/Page/Example/Test
www.domain.com/site/pool/Page/Example/Test/edit
www.domain.com/site/pool/Page/Example/Test/edit?id=1
www.domain.com/site/pool/Page/Example/Test/history
www.domain.com/site/pool/Page/Number/Room
www.domain.com/site/pool/Page/Number/Room/edit
having the htaccess change only those "/" for "_"
There is only /history and /edit finishing the url name nothing more, the normal is without /edit or /history.
If there are between one and five pieces in the page name, such as
Page
Page/Page
Page/Page/Page
Page/Page/Page/Page
Page/Page/Page/Page/Page
and every element starts with a capital letter and every other part of the url is only lowercase, then you can accomplish this with four replacements (the one-piece-only one needs no replacement). For example, for three pieces:
Find what: ([A-Z][a-z]+)\/([A-Z][a-z]+)\/([A-Z][a-z]+)(\/)?
Replace with: $1_$2_$3$4
You can try it here (although I don't understand why it's only replacing with spaces in every line but the last).
Notes:
Each part (...) is captured
The final slash \/? is optional, and is the slash before the potential "edit" or "history".
In some regex flavors, you don't need to escape the slash /, but it's safer to do so: \/.
Each $[number] is a capture group reference
WARNING! These replacements must be done from longest to shortest: five pieces, then four, then three, then two. Otherwise, you'll seriously mess things up.
All the links in this answer come from the Stack Overflow Regular Expressions FAQ. Please consider bookmarking it for future reference. In particular, see the list of online regex testers in the bottom section, so you can try things out yourself.
Place this code in /site/pool/.htaccess:
RewriteEngine On
RewriteBase /site/pool/
RewriteRule "^(Page)/([^/]+)/([^/]+/.*)$" /$1/$2_$3 [L]
RewriteRule "^(Page)/([^/]+)/([^/]+)$" /$1/$2_$3 [L]

ExpressionEngine RewriteRule RegEx Throws 500 Error

When using categories in ExpressionEngine, a Category URL Indicator trigger word can be set to load a category by its {category_url_title}.
I would like to remove the category "trigger word" from the URL.
Here is what I have so far, with the trigger word set to "category":
RewriteRule /products/(.+)$ /products/category/$1 [QSA,L]
I'm not an expert at writing regular expressions, but I do a little. I'm 99% sure my RegEx is fine, however when trying to use it as a RewriteRule in my .htaccess file, I'm getting a 500 error.
I'm sure it's something stupid, but for some reason I'm not seeing my mistake. What am I doing wrong?
Update: Adding a ^ to the beginning of the RewriteRule fixed the 500 error.
RewriteRule ^/products/(.+)$ /products/category/$1 [QSA,L]
This is not safe. Take:
/products/a
The regex group matches a.
It will be rewritten to:
/products/category/a
which the regex matches again (this time, the group matches category/a). Guess what will happen.
You want /products/ from the beginning of input if it is not followed by category/, which means you want a negative lookahead. Also, the QSA flag is of no use, you don't have a query string to rewrite (QSA stands for Query String Append):
RewriteRule ^/products/(?!category/)(.+) /products/category/$1 [L]
Another way to use it (and which I personally prefer) is to use a RewriteCond prior to the rule:
RewriteCond %{REQUEST_URI} ^/products/(?!category/)
RewriteRule ^/products/(.*) /products/category/$1 [L]
This Apache RewriteRule should do the job for you*:
RewriteCond %{REQUEST_URI} ^/products/(?!category/)
RewriteRule ^/products/(.*) /products/category/$1 [L]
With this in place, you'll need to hard code your category links manually:
{categories backspace="2"}
{category_name},
{/categories}
Which would output the new Category URLs you desire:
http://example.com/products/toys
Otherwise, if using the recommended path variable when building your category links:
{categories backspace="2"}
{category_name},
{/categories}
Would create links with the Category URL Indicator in the URI:
http://example.com/products/C1
http://example.com/products/category/toys
Which — while perfectly valid — would create canonicalization issues on your site since the different URLs would appear as duplicate content to search engines.
*Credit to fge for brilliant mod_write rule.

Can mod_rewrite preserve a double slash?

Im just learning mod_rewrite and regex stuff, and what I'm trying to do is pass variables of any name, with any number of variables and values, into a script and have them forwarded to a different script.
here is what I have so far:
RewriteEngine on
RewriteRule ^script\$(.*[\])? anotherscript?ip=%{REMOTE_ADDR}&$1 [L]
That all seems to work except that one of the parameters I'm passing is a URL and the // after http:// always gets stripped down to one slash.
for example, I do
script$url=http://www.stackoverflow.com
then it redirects to:
anotherscript?ip=127.0.0.1&url=http:/www.stackoverflow.com
and the second script chokes on the single-slash.
I realize that preserving a double-slash is the exact opposite of what people usually do with mod_rewrite. Is there a way I can preserve the double-slash?
EDIT: Solution found with Gumbo's help.
RewriteCond %{THE_REQUEST} ^GET\ (.*)/script\$([^\s]+)
RewriteRule ^script\$(.*) anotherscript?ip=%{REMOTE_ADDR}&%2 [L]
I had to add that (.*) in front of /script on the RewriteCond, once I did that it got rid of the 404 errors and then it was just a matter of passing the matches through.
Try this rule:
RewriteCond %{THE_REQUEST} ^GET\ /script\$([^\s]+)
RewriteRule ^script\$.+ anotherscript?ip=%{REMOTE_ADDR}&%1 [L]
See Diggbar modrewrite- How do they pass URLs through modrewrite? for the explanation.
I Think there may be something wrong with the first part of your RewriteRule regex
^script\$(.*[\])?
The backslash ( \ ) is used to escape a special character into a litteral one, thus you are actually trying to match a closing bracket ( ] ), is that intended ?
try this
RewriteRule ^script\$(.*)? anotherscript?ip=%{REMOTE_ADDR}&$1 [L]