regex rewrite url cluster - regex

I've been trying to learn regex and its terribly complicated. I'm not even positive that it's possible to rewrite these URLs without doing them individually. I can do them individually (search & replace) but there are a few different clusters and there are 1000's of URLs (migration).
This is a Joomla site running acesef software. Here is an example URL from 1 particular cluster. The end of the URL is identical for old and new URL. Only the beginning directories have changed. So is there a way to match the end of the URL for all URLs in those particular directories from old to new and rewrite it with a single expression?
Old URL = www.domain.com/property-details/condominiums/3448-page-title
New URL = www.domain.com/bangkok/condos/rent/3448-page-title
I won't even bother posting what I've tried to write so far, because its so far off. I'm trying to get my feet wet with regex but this is a pretty complicated rewrite for a beginner.

Well uh, at face value you could just use this:
[^/]+$
This will give you anything after the last / so in your example, you'd get 3448-page-title

Related

IIS URL rewrite not working properly 404 errors

I am upgrading a joomla website setup on IIS 10. Now I have oldsite.com and a newsite.com. My new site has slightly different folder structure but page names and content is same. Rightly so client doesn't want to lose SEO ranking on the old pages and want to redirect them to correct one on the new upgraded site.
i need to do following
is wildcard and will get replaced with whatever will be typed in the URL in it’s place
/div-services/* will redirect to /div/*
/div-questions/* will redirect to /div/questions/*
/fm-lw-services/* will redirect to /fm-lw/*
/locations/* will redirect to /contact/*
/resources/blog/* will redirect to /blog/*
/contact-us/* will redirect to /contact/*
I initially setup my pattern to
(.*)(div-services)(.*) becomes {R:1}( div){R:1}
It worked well till I have matching phrase to repeat in some form in the url. which in this case is “div-services” coming again in the URL, it gets replaced as well.
For example if the url is newsite.com/div-services/xyz/abc-div-services then per the rule it will replace both occurrences of “div-services” which is not desired I only need to replace the first occurrence. I though it’s a easy fix and made my pattern as following
(.*)(/div-services/)(.*) replace to {R:1}(/div/){R:1}
Even though in the test pattern it validates with success but it just doesn’t work and does not re write the URL I even tried with the escape character
(.*)(\/div-services\/)(.*) becomes {R:1}(/div/){R:1}
Still no luck. After digging and digging I found following example
div-services/(.*)$ becomes div/{R:1}
this worked generally well but now if I don’t have the ending forward slash it won’t work
for example if URL is is newsite.com/div-services it won’t work but is newsite.com/div-services/ and is newsite.com/div-services/xyx will work fine.
I am just at loss, any help will be much appreciated. I just don’t understand why can’t I detect the forward slash /
fyi I figured why this was not working (.*)(/div-services/)(.*) becomes {R:1}(/div/){R:1}
it is because input start the after the first forward slash, i was assuming that it would be the whole URL that is why my regular expression validates but actually doesn't work. As when it run it is only taking URL after the first slash, that clarifies so many things and logical explanation on why many of my patterns were not working even though they would pass the pattern test. Hopefully it save others the hours and hours i wasted because i didn't have clear understand how it was working
enter image description here

KimonoLabs crawler Generated URL List with regex

So, I'm trying to crawl a website that has like 7,000 product pages and the link structure is like this:
https://example.com/category/sub-category/numericid-name-of-the-product/
What I'm trying to achieve is to Generate a URL list, the Kimono App has that option, and it actually sections the URL but I'm only offered default value, range, and custom list.
I tried to put in stuff like "/.+/" to match all the chars, but that does not work, I couldn't find any help on that on official kb.
.I know that import.io had that "{alpahnumeric}" for example for different parts of URL so it matches them, is there a way to accomplish that in kimonolabs app?
Try this regex: https://example.com/([^/]+)/([^/]+)/([0-9]+)-([^/]+)
Note: you may need to escape some characters (namely / would be escaped as \/).
Also, I'm not familiar with KimonoLabs, so I don't know if this is what you're looking for exactly. Feel free to clarify.
Explanation
https://example.com/ literally
([^/]+)/ a bunch of not /s, followed by a /
([0-9]+)-([^/]+) Numbers followed by another bunch of not /s

301 Redirects using RegEx

I'm not great with RegEx. I have an Ecommerce site moving from PD Shop to Woocommerce. I need to write 301s for the pages on the old site to redirect to its corresponding page on the new site. The problem is the url structure for site A is completely different than it is for Site B. Rather than doing it manually for thousands of products, I wanted to use RegEx, but I'm not even sure it can be done.
If anyone has any insight on how to pull this off, I'd really appreciate the help. I'd prefer not to do it one link at a time, but I can't see how.
Old links are structured like this:
www.domain.com/shop/item.aspx/item-name/id/
Examples:
www.domain.com/shop/item.aspx/sierra-saw/58/
www.domain.com/shop/item.aspx/duffle-bag-double-strap-olive/2206/
www.domain.com/shop/item.aspx/duffle-bag-side-zipper-black/2207/
New links are structured like this:
www.domain.com/product/item-name/
Examples:
www.domain.com/product/sierra-saw/
www.domain.com/product/double-strap-duffle-bag/
www.domain.com/product/double-strap-duffle-bag/
You should match www.domain.com/shop/item.aspx/([^/]+)/.* and replace it with www.domain.com/product/\1/.
The matching pattern matches url starting with the common root (www.domain.com/shop/item.aspx/), groups their next path fragment (everything up to the next slash) and match the rest of the line.
The replace fragment just repeats the grouped path fragment next to the new common root.

Help convert Apache rewrite rules to PHP regular expressions

Short story: I am using this technique to auto-version my css and js files by adding a string to the filename with filemtime():
http://w-shadow.com/blog/2012/07/30/automatic-versioning-of-css-js/
I got it up and running perfectly on my local machine (MAMP), but I use WP Engine for my hosting and they are set up on nginx and don't support .htaccess rewrite rules.
They do have a place to enter PHP regular expressions (preg_replace), though, and their instructions look like this:
HTML Post-Processing
A mapping of PHP regular expressions to replacement values which are executed on all blog HTML after WordPress finishes emitting the entire page. The pattern and replacement behavior is in the manner of preg_replace().
The following example removes all HTML comments in the first pattern, and causes a favicon (with any filename extension) to be loaded from another domain in the second pattern:
#<!--.*?-->#s =>
#\bsrc="/(favicon\..*)"# => src="http://mycdn.somewhere.com/$1"
. So I'm wondering how hard it is to convert this rewrite rule to a PHP regular expression:
RewriteRule ^(.*)\.[\d]{10}\.(css|js)$ $1.$2 [L]
And if this would even be doing the same thing as the apache rewrite. the whole point of the technique is to bust the browser cache for css or js files and time they are changed, but without resorting to query strings, which have various drawbacks.
Actually, it's pretty much the same. Take your regex, delimit it, drop it in a string and escape the right things, then take your rewrite rule and use single quotes to make it a string, and you're done. In your example:
$newUrl = preg_replace('/^(.*)\\.[\\d]{10}\\.(css|js)$/', '$1.$2', $url);
This will properly rewrite anything url you give it. However, it sounds like these preg_replaces are being done across a large document, which means your regex there won't do what you think it will. That, however, is a completely separate question. One I won't even guess at, because I don't know what your requirements are. If you need help crafting the regex, please open another question with your specific requirements.
Also: Next time, Check the documentation.

Writing Regular Expression for URL in Google Analytics

I have a huge list of URL's, in the format:
http://www.example.com/dest/uk/bath/
http://www.example.com/dest/aus/sydney/
http://www.example.com/dest/aus/
http://www.example.com/dest/uk/
http://www.example.com/dest/nor/
What RegEx could I use to get the last three URL's, but miss the first two, so that every URL without a city attached is given, but the ones with cities are denied?
Note: I am using Google Analytics, so I need to use RegEx's to monitor my URL's with their advanced feature. As of right now Google is rejecting each regular expression.
Generally, the best suggestion I can make for parsing URL's with a Regex is don't.
Your time is much much better spent finding a libary that exists for your language dedicated to the task of processing URLs.
It will have worked out all the edge cases, be fully RFC compliant, be bug free, secure, and have a great user interface so you can just suck out the bits you really want.
In your case, the suggested way to process it would be, using your URL library, extract the element s and then work explicitly on them.
That way, at most you'll have to deal with the path on its own, and not have to worry so much wether its
http://site.com/
https://site.com/
http://site.com:80/
http://www.site.com/
Unless you really want to.
For the "Path" you might even wish to use a splitter ( or a dedicated path parser ) to tokenise the path into elements first just to be sure.
tj111's current solution doesn't work - it matches all your urls.
Here's one that works (and I checked with your values). It also matches, no matter if there is a trailing slash or not:
http:\/\/.*dest\/\w+/?$
/http:\/\/www\.site\.com\/dest\/\w+\/?$/i
matches if they're all the same site with the "dest" there. you could also do this:
/\w+:\/\/[^/]+\/dest\/\w+\/?$/i
which will match any site with any protocal (http,ftp) and any site with the /dest/country at the end, and an optional /
Note, that this will only work with a subset of what the urls could legitimately be.
Try this regular expression:
^http://www\.example\.com/dest/[^/]+/$
This would only match the last three URLs.