Use RegEx to redirect using data from files - regex

Recently, we restructured a large site of one of our customers. This caused all the news-articels on that site to be on a different place. Problem is that the google cache is still showing them on the old location, leading to A LOT of 404 not founds ( its about 1400 news entries ).
Normally, a redirect using somewhat simple regex would be fine, but not only the path to the news did change, but also some parameters. Example:
Old Url:
http://www.customers-url.com/old/path/to/the/news/details/?tx_ttnews%5Btt_news%5D=67&cHash=a782f3027c4d00face6df122a76c38ed
How the new url should look like:
http://www.customers-url.com/new/path/to/news/?tx_news_pi1%5Bnews%5D=65
As you can see, the parameter D did change from 67 to 65 and the part of the URL before the ? did also change. Also, tx_ttnews has changed to tx_news and tt_news changed to news and the &cHash part did fall away completely.
I have the changed ids in a csv in the following format:
old_id, new_id
1,2
2,3
...etc...
Same goes the the changed url part before the ?. Since im not exactly an expert in using regex my question is:
Can this be done inside the .htaccess using RegEx ( not sure if it can even use a file as input)? How complicated is that? And how would such a regular expression look like?

Rather than trying to use .htaccess, it would be easier to manage and easier to code if you simply make a new page that responds on the old url (/old/path/to/the/news/details), then make that page build the new url and return a 301 to the browser with that new url.

Related

Regex specific question and search function on my website dealing with broken links

I've been trying to figure out my regex pattern but it doesn't seem to be working for me.
Here's what i'm trying to do:
I have broken links on my website if someone accidentally gets to a page like so:
https://example.com/catalogsearch/result/?q=
or
https://example.com/catalogsearch/result/
So i'm redirecting them back to my homepage. The problem is now the search is just sending everything back to the homepage. So i'm assuming if there is something after the equals it needs to continue the search.. obviously
https://example.com/catalogsearch/result/?q=person
but currently i can't figure this out..
Here is my regex that i've been messing with for quite sometime now... still seems to be wrong or something else is wrong with my search.
"^/catalogsearch/result((/)|(/\\?)|(/\\?[a-z])|(/\\?[a-z]=))?$"
Please forgive me i'm horrible with regex.
After a lot of discussion, it is concluded that the routes.yaml will consider the url path as a valid route but not the query string part. Hence out of the two examples in the post, you can use
"/catalogsearch/result": { to: "https://example.com/", prefix: false }
and for other one please change it in nginx config to redirect to homepage or if its not possible then check with magento support on how to incorporate the query string part in routes.yaml file.

Alternative to <!--#include virtual="somefilename"-->

I have a website running an an old apache server with SSI enabled. My host wants to move to a new server which has SSI disabled for security reasons.
I have a whole lot of pages with Google Friendly urls which just have one line
<!--#include virtual="Url_Including_Search_String"-->
What is the best alternative to the SSI to keep my google friendly search strings returning the specified search result?
I can achieve most of the results with rewrite rules in the .htaccess file, however some search strings have a space in the keyword but the url doesn't. I can't do this with a rewrite rule
ie www.somedomain.com.au/SYDNEY.htm would have
<!--#include virtual="/search.php?keyword=SYDNEY&Submit=SEARCH"-->
However,the issue is
www.somedomain.com.au/POTTSPOINT.htm would have
<!--#include virtual="/search.php?keyword=POTTS+POINT&Submit=SEARCH"-->
A rewrite rule cannot detect where a space should be in a Suburb name, so hoping there is an alternative for <!--#include virtual=
I have looked at RewriteMap but don't think I can access the file I would need to put this in.
I would use Mod Rewrite to redirect any calls to non-existent files to your Search page.
For example:
http://example.com/SYDNEY redirects to
http://example.com/search.php?q=SYDNEY
(assuming there is not actually a /SYDNEY/ file at your server root.)
Then get rid of all those individual redirect pages.
As for the spaces, I'd modify my actual Search page to recognize (for example) "POTTSPOINT" and figure out that the space should be inserted. Basically compare the search term against a database of substitutions.

301 redirect to correct url

I have a lot of incorrect bad links and want to 301 redirect them to the correct one, the correct url are as follows:
Blockquote http://www.domain.com/string-video_string.html
the back links are pointing to:
Blockquote http://www.domain.com/string_string.html
any possible way to 301 redirect the wrong back links to the correct links?
Thank you in adance
You can use this rule in your site root .htaccess:
RedirectMatch 301 ^/([^_-]+)_(.+)$ /$1-video_$2
Depending on how you want to redirect (in which method; PHP, htaccess, etc.) you have some options.
I assume you're seeing 404 errors when users are trying to get to the links from an external source, like a search engine.
If that's the case, you can easily generate the code you need for which ever method you choose using this website:
http://www.rapidtables.com/web/tools/redirect-generator.htm
Make sure that you correctly format the URL's you want to redirect and it should work fine.
If you want to make sure your SEO issues get fixed, you should create a robots.txt file and place it in the root directory of your site (usually where the index file is) - and follow the instructions from this site: http://tools.seobook.com/robots-txt/ to de-index the bad links from the search engine. You may also want to create and submit (or resubmit) XML site maps to the search engines your users use most.

What does this URL mean?

http://localhost/students/index.cfm/register?action=studentreg
I did not understand the use of 'register' after index.cfm. Can anyone please help me understand what it could mean? There is a index.cfm file in students folder. Could register be a folder name?
They might be using special commands within their .htaccess files to modify the URL to point to something else.
Things like pointing home.html -> index.php?p=home
ColdFusion will execute index.cfm. It is up to the script to decide what to do with the /register that comes after.
This trick is used to build SEO friendly URL's. For example http://www.ohnuts.com/buy.cfm/bulk-nuts-seeds/almonds/roasted-salted - buy.com uses the /bulk-nuts-seeds/almonds/roasted-salted to determine which page to show.
Whats nice about this is it avoids custom 404 error handlers and URL rewrites. This makes it easier for your application to directly manage the URL's used.
I don't know if it works on all platforms, as I've only used it on IIS.
You want to look into the cgi.PATH_INFO variable, it is populated automatically by CF server when such URL format used.
Better real-life example would look something like this.
I have an URL which I want to make prettier:
http://mybikesite/index.cfm?category=bicycles&manufacturer=cannondale&model=trail-sl-4
I can rewrite it this way:
http://mybikesite/index.cfm/category/bicycles/manufacturer/cannondale/model/trail-sl-4
Our cgi.PATH_INFO value is: /category/bicycles/manufacturer/cannondale/model/trail-sl-4
We can parse it using list functions to get the same data as original URL gives us automatically.
Second part of your URL is plain GET variable, it is pushed into URL scope as usually.
Both formats can be mixed, GET vars may be used for paging or any other secondary stuff.
index.cfm is using either a CFIF IsDefind("register") or a CFIF #cgi.Path_Info# CONTAINS statements to execute a function or perform a logic step.

URL routing process works on one web, not another. 100% processor usage

I thought I have URL routing under control as it works well on one website but found out it is not working fine on another. Both websites run from same server, same IIS 6.0, same asp_isapi.dll.
SETUP 1: This one works well:
routes.MapPageRoute("Article SEO",
"sid/{sid}",
"~/ar.aspx",
true,
new RouteValueDictionary { },
new RouteValueDictionary { { "sid", #"^([a-zA-Z\-]*)+([a-zA-Z])+(.html)$" } }
);
SETUP 2: But this one, very similar is not working well:
routes.MapPageRoute("Article",
"page/{sid}",
"~/page.aspx",
true,
new RouteValueDictionary { },
new RouteValueDictionary { { "sid", #"^([a-zA-Z0-9\-]*)+([a-zA-Z0-9])+(.html)$" } }
);
Testing Regex in the Regex Coach shows that they are written correctly, I mean they both catch good or wrong strings.
URL I use for the second one is http://address/page/some-html-keywords.html. If I specify URL like this it works well.
Problem is if I change .html extension for something like .htmls or .anything it completely kills web server. I have 100% process usage. I dont understand why and how, I dont have this problem with first setup. I can change it for whatever I want and it either shows page because I have correct format or shows 404 page not found.
Some examples:
http://address/page - 404 page, working correctly
http://address/page/test.html - accepted, working correctly
http://address/page/testing_#.html - 404 page, working correctly
http://address/page/test.htmls - wont show page, hanging, 100% process usage, not working correctly
http://address/page/test.whatever - wont show page, hanging, 100% process usage, not working correctly
http://address/page/page.aspx - redirects, working correctly
The same setup (with different Regex check) works well on other website within same IIS 6.0. Both use same asp_isapi.dll file.
I just dont get it. I have tried to comment all code in page.aspx to find out if there was problem with the code within page.aspx but it doesnt matter. It simply hangs with empty page as well. So must be problem with URL routing or isapi.dll or IIS. But other website on same IIS and same machine simply works.
Any opinions?
Thank you
Fero
I don't know anything about URL routing
BUT I notice that the regular expression you specify
#"^([a-zA-Z0-9\-]*)+([a-zA-Z0-9])+(.html)$"
Looks to be the same in both code samples AND (again, in both examples) ends with a trailing $ (which means end-of-string), which will prohibit anything NOT ending in .html from being matched by that regular expression. To get .htmls you need (.html.*)$, to get .anything you need something like
#"^([a-zA-Z0-9\-]*)+([a-zA-Z0-9])+\.[a-zA-Z0-9]*$"
Also, it probably would be a good idea to esacpe the '.' just before html, like \.html, as reg expressions normally process '.' to mean any single character, which includes the '.' character.
I Hope this helps.
Learn how to analyze ASP.NET high CPU root cause, and then you will find out why,
http://blogs.msdn.com/b/tess/archive/2008/02/22/net-debugging-demos-lab-4-high-cpu-hang.aspx