URLs rewrite - Removing unwanted characters and slashes - regex

I need to rewrite URL's using .htaccess to redirect all users from old sitemap to the new URL's. The old URLs looks like this:
http://example.com/a/v/c/document_title1.php
http://example.com/fr/x/1/o/document_title2.php
http://example.com/de/a/a/2/document_title3.php
http://example.com/eng/6/z/z/document_title4.php
I need to keep first directory and then remove all sub-directories, including slashes and characters (letters and numbers between /) so the new URLs will look like this:
http://example.com/document_title1.php
http://example.com/fr/document_title1.php
http://example.com/de/document_title2.php
http://example.com/eng/document_title3.php
I tried various online generators and always got 500 error. Is this something I can do?

You can use this substitution
Pattern: (?<=http://example\.com/)(fr/|de/|eng/)?.*(?=document_title\d\.php)
Replace with: $1
This will almost definitely need to include some tweaks, but I'm working with what I was given. In this section (fr/|de/|eng/)? you'll need to add | for each possible language extension.

Related

.htaccess RedirectMatch conditional regex fails

Good evening dear fellow coders,
I am trying to handle urls without file extensions that are more readable to average internet users using a .htaccess redirect, like http://example.com/file to http://example.com/file.php (with or without query)
Unfortunately I am not able to use mod_rewrite, but although redirect does work, it seems not to be able to handle my request properly.
To handle any given URL I tried using
RedirectMatch ^/(?(?=.*\.php(?i).*)|(\w+)(.*)) /$1.php$2
And
RedirectMatch ^/.*\.php(?i).*|(\w+)(.*) /$1.php$2
As well as using $2 and $3, assuming the behaviour might extract the first pattern contrary to every knowledge.
It should extract characters and numbers for $1 and everything else for $2 (starting with ? for queries etc.) unless it contains the file extension .php.
Validating the regex with https://regex101.com/r/zF2bV9/2 everything should work fine, but implementing one of these lines to the .htaccess the filename will replace any given file with ".php" (as in http://example.com/.php) and obviously produce an error of a non-existing file.
What am I missing about the code or the redirect functionality?
You can try something like:
RedirectMatch ^/([^.]+)$ /$1.php
This matches the URL providing there is not already a dot (ie. .php) in the URL. And so prevents a redirect loop.
As mentioned in comments, you don't need to do anything specific with the query string, providing you want it passed through to the destination unaltered. The query string is not present in the URL-path that the RedirectMatch directive matches against anyway. So, any manipulation of the query string would require mod_rewrite.

Regex to match specific url with or without extensions but not including subdomains, and to match relative paths

I've got to match a slew of urls as well as relative paths that I need to match with regex, which could vary a great deal in their exact format, but will basically look like the following:
http://example.com/asbd
http://example.com/products/asdb
http://example.com/
/Products/product.aspx
/Products/random/articles.aspx
/Products/product
/Products/random/articles
So far I have the following:
http://(?:www\.| )*example\.com(?:/|)[A-Za-z0-9](.*)$
which matches
http://example.com/asbd
It is very important the regex does not match subdomain urls like the following:
http://careers.example.com/
http://investor.example.com/asdf
http://newsroom.example.com/
The following makes sure it's the main domain OR a relative link:
^(?:http:\/\/(?:www\.)?example\.com)?(?:\/.*)$ => DEMO

matching only numbers in a regex string for redirect

I am using a redirection plugin for wordpress ad have no experience with regex.
I have a url that can have anything after the url, but I only want to redirect if only numbers appear and nothing else, such that of the following urls only the last one would get a match:
http://j.net/contact
http://j.net/c4t
http://j.net/4con
http://j.net/4co5
http://j.net/anything/123 * this should fail
http://j.net/456 * this should pass
I came up with this:
(\d+)$
to:
article/$1
But I ended up in an infinite loop.
Edit: the loop seems to come into play when navigating to:
http://j.net/1289
Or:
http://j.net/dribble/1289
Your solution seems to work fine, after escaping the / character
See the example: http://regex101.com/r/cX4bV6/2
PS. i'm not sure what language you are using and whether wordpress would support it.

More efficient RewriteRule for messy URL

I have developed a new web site to replace an existing one for a client. Their previous site had some pretty nasty looking URLs to their products. For example, an old URL:
http://mydomain.com/p/-3-0-Some-Ugly-Product-Info-With-1-3pt-/a-arbitrary-folder/-18pt/-1-8pt-/ABC1234
I want to catch all requests to the new site that use these old URLs. The information I need out of the old URL is the ABC1234 which is the product ID. To clarify, the old URL begins with /p/ followed by four levels of folders, then the product ID.
So for example, the above URL needs to get rewritten to this:
http://mydomain.com/shop/?sku=ABC1234
I'm using Apache 2.2 on Linux. Can anyone point me on the correct pattern to match? I know this is wrong, but here is where I am currently at:
RewriteRule ^p/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)/([A-Za-z0-9-]+)?$ shop/?sku=$5 [R=301,NC,L]
I'm pretty sure that the pattern used to match each of the 4 folders is redundant, but I'm just not that sharp with regex. I've tried some online regex evaluators with no success.
Thank you.
--EDIT #1--
Actually, my RewriteRule above does work, but is there a way to shorten it up?
--EDIT #2--
Thanks to ddr, I've been able to get this expression down to this:
RewriteRule ^p/([\w-]+/){4}([\w-]+)$ shop/?_sku=$2 [R=301,NC,L]
--EDIT #3--
Mostly for the benefit of ddr, but I welcome anyone to assist who can. I'm dealing with over 10,000 URLs that need to be rewritten to work with a new site. The information I've provided so far still stands, but now that I am testing that all of the old URLs are being rewritten properly I am running into a few anomolies that don't work with the RewriteRule example provided by ddr.
The old URLs are consistent in that the product ID I need is at the very end of the URL as documented above. The first folder is always /p/. The problem I am running into at this point is that some of the URLs have a URL encoded double quote ("). Additionally, some of the URLs contain a /-/ as one of the four folders mentioned. So here are some examples of the variations in the old URLs:
/p/-letters-numbers-hyphens-88/another-folder/-and-another-/another-18/ABC1234
/p/-letters-numbers-hyphens-88/2%22/-/-/ABCD1234
/p/letters-numbers-hyphens-1234/34-88/-22/-/ABCD1234/
Though the old URLs are nasty, I think it is safe to say that the following are always true:
Each begins with /p/
Each ends with the product ID that I need to isolate.
There are always four levels of folders between /p/ and the product ID.
Some folders in between have hyphens, some don't.
Some folders in between are a hyphen ONLY.
Some folders in between contain a % character where they are URL encoded.
Some requests include a / at the very end and some do not.
The following rule was provided by ddr and worked great until I ran into the URLs that contain a % percent sign or a folder with only a hyphen:
RewriteRule ^p/(?:[\w-]+/){4}([\w-]+)$ shop/?_sku=$1 [R=301,NC,L]
Given the rule above, how do I edit it to allow for a folder that is hyphen only (/-/) or for a percent sign?
You can use character classes to reduce some of the length. The parentheses (capture groups) are also unnecessary, except the last one, as #jpmc26 says.
I'm not especially familiar with Apache rules, but try this:
RewriteRule ^p/(?:[\w-]+/){4}([\w-]+)$ shop/?sku=$1 [R=301,NC,L]
It should work if extended regular expressions are supported.
\w is equivalent to [A-Za-z0-9_] and you don't need to not capture underscores, so that's one replacement.
The {4} matches exactly four repetitions of the previous group. This is not always supported so Apache may not like it.
The ?: is optional but indicates that these parens should not be treated as a capture. Makes it slightly more efficient.
I'm not sure what the part in [] at the end is for but I left it. I can't see why you'd need a ? before the $, so I took it out.
Edit: the most compact way, if Apache likes it, would probably be
RewriteRule ^p(/[\w-]+){5}$ shop/?sku=$5 [R=301,NC,L]
EDIT: response to edit 3 of the question.
I'm surprised it doesn't work with only -. The [\w-]+ should match even where there is just a single -. Are you sure there isn't something else going on in these URLs?
You might also try replacing - in the regex with \-.
As for the %, just change [\w-] to [\w%-]. Make sure you leave the - at the end! Otherwise the regex engine will try to interpret it as part of a char sequence.
EDIT 2: Or try this:
RewriteRule ^p/(?:.*?/){4}(.*?)/?$ shop/?sku=$1 [R=301,NC,L]

.htaccess Negated Lookahead Regex malformed

The following is a segment from my .htaccess file.
I want the following behaviour from Apache (currently the site is at localhost, but that shouldn't matter, right?):
If the requested resource is anything else other than
{site_url}/core
or
{site_url}/login
like
{site_url}/pseudo/path/name
the resource served must be
{site_url}/site/pseudo/path/name
Otherwise the URL served must be {site_url}/core or {site_url}/login, i.e. whatever was requested.
The .htaccess file is:
<IfModule mod_alias.c>
AliasMatch ^/(?!core|login)(/?.*)$ /site/$2
Header add X-Enabled mod_alias
</IfModule>
But this doesn't seem to be working and returns an error. I am not very familiar with Regular Expressions and am trying to learn these. So what I have inferred from this expressions is:
If the expression after '/' , i.e. URI after site_url does not match core or login (?!core|login)) , and is followed by anything, inclusive of a sub-folder (/?.*)$ Optional slash, and anything following it, set the alias to /site/(anything that was matched in second parentheses).
The module is working, which I've checked using only the Header add part, the problem is the regex.
Please help.
Leave off the ^ at the beginning. The regex you have would match /pseudo/path/name but not {site_url}/pseudo/path/name, because you're telling it that the text must begin with a /.
Also, be careful, because your regex is excluding things like {site_url}/corel. That's probably not going to be a problem unless you have other directories beginning with core, but if you really want to make it match anything other than {site_url}/core or {site_url}/login, use this regex:
/(?!core$|login$)(/?.*)$