Regex specific question and search function on my website dealing with broken links - regex

I've been trying to figure out my regex pattern but it doesn't seem to be working for me.
Here's what i'm trying to do:
I have broken links on my website if someone accidentally gets to a page like so:
https://example.com/catalogsearch/result/?q=
or
https://example.com/catalogsearch/result/
So i'm redirecting them back to my homepage. The problem is now the search is just sending everything back to the homepage. So i'm assuming if there is something after the equals it needs to continue the search.. obviously
https://example.com/catalogsearch/result/?q=person
but currently i can't figure this out..
Here is my regex that i've been messing with for quite sometime now... still seems to be wrong or something else is wrong with my search.
"^/catalogsearch/result((/)|(/\\?)|(/\\?[a-z])|(/\\?[a-z]=))?$"
Please forgive me i'm horrible with regex.

After a lot of discussion, it is concluded that the routes.yaml will consider the url path as a valid route but not the query string part. Hence out of the two examples in the post, you can use
"/catalogsearch/result": { to: "https://example.com/", prefix: false }
and for other one please change it in nginx config to redirect to homepage or if its not possible then check with magento support on how to incorporate the query string part in routes.yaml file.

Related

Regex to replace spam links in Wordpress

I am dealing with old hacked sites in Wordpress where there are injection spam links on images.
I have access to the database and would like to remove links that look like this:
<a style="text-decoration:none" href="/ansaid-retail-cost">.</a>
Now text varies inside the <href> it might be for cialas or any product, but the rest doesn't vary. I want to remove the entire LINK, so the result is a single space.
I don't know regex, so I would appreciate the help. I've tried online generators but they don't seem to be working.

Use RegEx to redirect using data from files

Recently, we restructured a large site of one of our customers. This caused all the news-articels on that site to be on a different place. Problem is that the google cache is still showing them on the old location, leading to A LOT of 404 not founds ( its about 1400 news entries ).
Normally, a redirect using somewhat simple regex would be fine, but not only the path to the news did change, but also some parameters. Example:
Old Url:
http://www.customers-url.com/old/path/to/the/news/details/?tx_ttnews%5Btt_news%5D=67&cHash=a782f3027c4d00face6df122a76c38ed
How the new url should look like:
http://www.customers-url.com/new/path/to/news/?tx_news_pi1%5Bnews%5D=65
As you can see, the parameter D did change from 67 to 65 and the part of the URL before the ? did also change. Also, tx_ttnews has changed to tx_news and tt_news changed to news and the &cHash part did fall away completely.
I have the changed ids in a csv in the following format:
old_id, new_id
1,2
2,3
...etc...
Same goes the the changed url part before the ?. Since im not exactly an expert in using regex my question is:
Can this be done inside the .htaccess using RegEx ( not sure if it can even use a file as input)? How complicated is that? And how would such a regular expression look like?
Rather than trying to use .htaccess, it would be easier to manage and easier to code if you simply make a new page that responds on the old url (/old/path/to/the/news/details), then make that page build the new url and return a 301 to the browser with that new url.

content empty when using scrapy

Thanks for everyone in advance.
I encountered a problem when using Scrapy on Python 2.7.
The webpage I tried to crawl is a discussion board for Chinese stock market.
When I tried to get the first number "42177" just under the banner of this page (the number you see on that webpage may not be the number you see in the picture shown here, because it represents the number of times this article has been read and is updated realtime...), I always get an empty content. I am aware that this might be the dynamic content issue, but yet don't have a clue how to crawl it properly.
The code I used is:
item["read"] = info.xpath("div[#id='zwmbti']/div[#id='zwmbtilr']/span[#class='tc1']/text()").extract()
I think the xpath is set correctly and I have checked the return value of this response and it indeed told me that there is nothing under this directory. Results shown here:'read': [u'<div id="zwmbtilr"></div>']
If it has something, there should be something between <div id="zwmbtilr"> and </div>.
Really appreciated if you guys share any thoughts on this!
I just opened your link in Firefox with NoScript enabled. There nothing inside the <div #id='zwmbtilr'></div>. If I enable the javascripts, I can see the content you want. So, as you already new, it is a dynamic content issue.
Your first option is try to identify the request generated by javascript. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.

URL redirect plugin regex input for match and target

I'm panicking a little, so sorry if I haven't explained well enough.
I've dealt with quite the nightmare of a permalink restructuring experience
Old permalink= sitename/archives/postid
desired new= sitename/postname
tried everything it seems. I've even dabbled with /?p=$1 (<-----that nonsense!). But now i'm getting some crazy error when i go to my old permalink structure that reads:
Oops! Google Chrome could not connect to 0.0.37.89
Suggestions:
Try reloading: 0.­0.­37.­89
and this was supposed to be "redirected".
I give up. please help.
sitename= brightontheday.com
I used the redirection plugin to redirect all old URL permalinks (/archives/postID) to the new permalink (/postID/postname)
also, the issue appeared to be due to cashing via cloudfare. It's important to to note that one should put cloudfare in "developer mode" while making site wide changes.

URL routing process works on one web, not another. 100% processor usage

I thought I have URL routing under control as it works well on one website but found out it is not working fine on another. Both websites run from same server, same IIS 6.0, same asp_isapi.dll.
SETUP 1: This one works well:
routes.MapPageRoute("Article SEO",
"sid/{sid}",
"~/ar.aspx",
true,
new RouteValueDictionary { },
new RouteValueDictionary { { "sid", #"^([a-zA-Z\-]*)+([a-zA-Z])+(.html)$" } }
);
SETUP 2: But this one, very similar is not working well:
routes.MapPageRoute("Article",
"page/{sid}",
"~/page.aspx",
true,
new RouteValueDictionary { },
new RouteValueDictionary { { "sid", #"^([a-zA-Z0-9\-]*)+([a-zA-Z0-9])+(.html)$" } }
);
Testing Regex in the Regex Coach shows that they are written correctly, I mean they both catch good or wrong strings.
URL I use for the second one is http://address/page/some-html-keywords.html. If I specify URL like this it works well.
Problem is if I change .html extension for something like .htmls or .anything it completely kills web server. I have 100% process usage. I dont understand why and how, I dont have this problem with first setup. I can change it for whatever I want and it either shows page because I have correct format or shows 404 page not found.
Some examples:
http://address/page - 404 page, working correctly
http://address/page/test.html - accepted, working correctly
http://address/page/testing_#.html - 404 page, working correctly
http://address/page/test.htmls - wont show page, hanging, 100% process usage, not working correctly
http://address/page/test.whatever - wont show page, hanging, 100% process usage, not working correctly
http://address/page/page.aspx - redirects, working correctly
The same setup (with different Regex check) works well on other website within same IIS 6.0. Both use same asp_isapi.dll file.
I just dont get it. I have tried to comment all code in page.aspx to find out if there was problem with the code within page.aspx but it doesnt matter. It simply hangs with empty page as well. So must be problem with URL routing or isapi.dll or IIS. But other website on same IIS and same machine simply works.
Any opinions?
Thank you
Fero
I don't know anything about URL routing
BUT I notice that the regular expression you specify
#"^([a-zA-Z0-9\-]*)+([a-zA-Z0-9])+(.html)$"
Looks to be the same in both code samples AND (again, in both examples) ends with a trailing $ (which means end-of-string), which will prohibit anything NOT ending in .html from being matched by that regular expression. To get .htmls you need (.html.*)$, to get .anything you need something like
#"^([a-zA-Z0-9\-]*)+([a-zA-Z0-9])+\.[a-zA-Z0-9]*$"
Also, it probably would be a good idea to esacpe the '.' just before html, like \.html, as reg expressions normally process '.' to mean any single character, which includes the '.' character.
I Hope this helps.
Learn how to analyze ASP.NET high CPU root cause, and then you will find out why,
http://blogs.msdn.com/b/tess/archive/2008/02/22/net-debugging-demos-lab-4-high-cpu-hang.aspx