Alternative to <!--#include virtual="somefilename"--> - ssi

I have a website running an an old apache server with SSI enabled. My host wants to move to a new server which has SSI disabled for security reasons.
I have a whole lot of pages with Google Friendly urls which just have one line
<!--#include virtual="Url_Including_Search_String"-->
What is the best alternative to the SSI to keep my google friendly search strings returning the specified search result?
I can achieve most of the results with rewrite rules in the .htaccess file, however some search strings have a space in the keyword but the url doesn't. I can't do this with a rewrite rule
ie www.somedomain.com.au/SYDNEY.htm would have
<!--#include virtual="/search.php?keyword=SYDNEY&Submit=SEARCH"-->
However,the issue is
www.somedomain.com.au/POTTSPOINT.htm would have
<!--#include virtual="/search.php?keyword=POTTS+POINT&Submit=SEARCH"-->
A rewrite rule cannot detect where a space should be in a Suburb name, so hoping there is an alternative for <!--#include virtual=
I have looked at RewriteMap but don't think I can access the file I would need to put this in.

I would use Mod Rewrite to redirect any calls to non-existent files to your Search page.
For example:
http://example.com/SYDNEY redirects to
http://example.com/search.php?q=SYDNEY
(assuming there is not actually a /SYDNEY/ file at your server root.)
Then get rid of all those individual redirect pages.
As for the spaces, I'd modify my actual Search page to recognize (for example) "POTTSPOINT" and figure out that the space should be inserted. Basically compare the search term against a database of substitutions.

Related

301 redirect to correct url

I have a lot of incorrect bad links and want to 301 redirect them to the correct one, the correct url are as follows:
Blockquote http://www.domain.com/string-video_string.html
the back links are pointing to:
Blockquote http://www.domain.com/string_string.html
any possible way to 301 redirect the wrong back links to the correct links?
Thank you in adance
You can use this rule in your site root .htaccess:
RedirectMatch 301 ^/([^_-]+)_(.+)$ /$1-video_$2
Depending on how you want to redirect (in which method; PHP, htaccess, etc.) you have some options.
I assume you're seeing 404 errors when users are trying to get to the links from an external source, like a search engine.
If that's the case, you can easily generate the code you need for which ever method you choose using this website:
http://www.rapidtables.com/web/tools/redirect-generator.htm
Make sure that you correctly format the URL's you want to redirect and it should work fine.
If you want to make sure your SEO issues get fixed, you should create a robots.txt file and place it in the root directory of your site (usually where the index file is) - and follow the instructions from this site: http://tools.seobook.com/robots-txt/ to de-index the bad links from the search engine. You may also want to create and submit (or resubmit) XML site maps to the search engines your users use most.

Exclude urls without 'www' from Nutch 1.7 crawl

I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.
Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both 'www' and '' urls such as:
www.mywebsite.com
mywebsite.com
www.mywebsite.com/page1.html
mywebsite.com/page1.html
My understanding is that the url filtering aka regex-urlfilter.txt needs modification. Are there any regex/nutch experts that could suggest a solution?
Here is the code on pastebin.
There are at least a couple solutions.
1.) urlfilter-regex plugin
If you don't want to crawl the non-www pages at all, or else filter them at a later stage such as at index time, that is what the urlfilter-regex plugin is for. It lets you mark any URLs matching the regex patterns starting with "+" to be crawled. Anything that does not match a regex prefixed with a "+" will not be crawled. Additionally in case you want to specify a general pattern but exclude certain URLs, you can use a "-" prefix to specify URLs to subsequently exclude.
In your case you would use a rule like:
+^(https?://)?www\.
This will match anything that starts with:
https://www.
http://www.
www.
and therefore will only allow such URLs to be crawled.
Based on the fact that the URLs listed were already not being excluded given your regex-urlfilter, it means either the plugin wasn't turned on in your nutch-site.xml, or else it is not pointed at that file.
In nutch-site.xml you have to specify regex-urlfilter in the list of plugins, e.g.:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-basic|query-(basic|site|url)|response-(json|xml)|urlnormalizer-(pass|regex|basic)</value>
</property>
Additionally check that the property specifying which file to use is not over-written in nutch-site.xml and is correct in nutch-default.xml. It should be:
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
and regex-urlfilter.txt should be in the conf directory for nutch.
There is also the option to only perform the filtering at different steps, e.g., index-time, if you only want to filter than.
2.) solrdedup command
If the URLs point to the exact same page, which I am guessing is the case here, they can be removed by running the nutch command to delete duplicates after crawling:
http://wiki.apache.org/nutch/bin/nutch%20solrdedup
This will use the digest values computed from the text of each indexed page to find any pages that were the same and delete all but one.
However you would have to modify the plugin to change which duplicate is kept if you want to specifically keep the "www" ones.
3.) Write a custom indexing filter plugin
You can write a plugin that reads the URL field of a nutch document and converts it in any way you want before indexing. This would give you more flexible than using an existing plugin like urlnormalize-regex.
It is actually very easy to make plugins and add them to Nutch, which is one of the great things about it. As a starting point you can copy and look at one of the other plugins including with nutch that implement IndexingFilter, such as the index-basic plugin.
You can also find a lot of examples:
http://wiki.apache.org/nutch/WritingPluginExample
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Is there a "clean URL" (mod_rewrite) equivalent for iPlanet?

I'm working with Coldfusion (because I have to) and we use iPlanet 7 (because we have to), and I would like to pass clean URL's instead of the query-param junk (for numerous reasons). My problem is I don't have access to the overall obj.conf file, and was wondering if there were .htaccess equivalents I could pass on the fly per directory. Currently I am using Application.cfc to force the server to look at index.cfm in root before loading the requested page, but this requires a .cfm file is passed, so it just 404's out if the user provides /path/to/file but no extension. Ultimately, I would like to allow the user to pass domain.com/path/to/file but serve domain.com/index.cfm?q1=path&q2=to&q3=file. Any ideas?
You can mod_dir with the DirectoryIndex directive to set which page is served on /directory/ requests.
http://httpd.apache.org/docs/2.2/mod/mod_dir.html
I'm not sure what exists for iPlanet, haven't had to work with it before. But it would be possible to use a url like index.cfm/path/to/file, and pull the extra path information via the cgi.path_info variable. Not exactly what you're looking for, but cleaner that query-params.

What does this URL mean?

http://localhost/students/index.cfm/register?action=studentreg
I did not understand the use of 'register' after index.cfm. Can anyone please help me understand what it could mean? There is a index.cfm file in students folder. Could register be a folder name?
They might be using special commands within their .htaccess files to modify the URL to point to something else.
Things like pointing home.html -> index.php?p=home
ColdFusion will execute index.cfm. It is up to the script to decide what to do with the /register that comes after.
This trick is used to build SEO friendly URL's. For example http://www.ohnuts.com/buy.cfm/bulk-nuts-seeds/almonds/roasted-salted - buy.com uses the /bulk-nuts-seeds/almonds/roasted-salted to determine which page to show.
Whats nice about this is it avoids custom 404 error handlers and URL rewrites. This makes it easier for your application to directly manage the URL's used.
I don't know if it works on all platforms, as I've only used it on IIS.
You want to look into the cgi.PATH_INFO variable, it is populated automatically by CF server when such URL format used.
Better real-life example would look something like this.
I have an URL which I want to make prettier:
http://mybikesite/index.cfm?category=bicycles&manufacturer=cannondale&model=trail-sl-4
I can rewrite it this way:
http://mybikesite/index.cfm/category/bicycles/manufacturer/cannondale/model/trail-sl-4
Our cgi.PATH_INFO value is: /category/bicycles/manufacturer/cannondale/model/trail-sl-4
We can parse it using list functions to get the same data as original URL gives us automatically.
Second part of your URL is plain GET variable, it is pushed into URL scope as usually.
Both formats can be mixed, GET vars may be used for paging or any other secondary stuff.
index.cfm is using either a CFIF IsDefind("register") or a CFIF #cgi.Path_Info# CONTAINS statements to execute a function or perform a logic step.

Codeigniter Routes for filename with extension

I am using codeigniter and its routes system successfully with some lovely regexp, however I have come unstuck on what should be an easy peasy thing in the system.
I want to include a bunch of search engine related files (for Google webmaster etc.) plus the robots.txt file, all in a controller.
So, I have create the controller and updated the routes file and don't seem to be able to get it working with these files.
Here's a snip from my routes file:
$route['robots\.txt|LiveSearchSiteAuth\.xml'] = 'search_controller/files';
Within the function I use the URI helper to figure out which content to show.
Now I can't get this to match, which points to my regexp being wrong. I'm sure this is a really obvious one but its late and my caffeine tank is empty :)
You should not need to escape the full stop, CodeIgniter does most of the escaping for you.
Here is a working example I use:
$route['news/rss/all.rss'] = "news/rss";
Issue was actually in .htaccess file where I had created a rewrite exception to allow the search engine files to be accessed directly rather than routing them through codeigniter.
RewriteCond $1 !^(index\.php|google421b29fc254592e0.html|LiveSearchSiteAuth.xml|content|robots\.txt|favicon.ico)
Became
RewriteCond $1 !^(index\.php|content|favicon.ico)