Exclude urls without 'www' from Nutch 1.7 crawl - regex

I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.
Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both 'www' and '' urls such as:
www.mywebsite.com
mywebsite.com
www.mywebsite.com/page1.html
mywebsite.com/page1.html
My understanding is that the url filtering aka regex-urlfilter.txt needs modification. Are there any regex/nutch experts that could suggest a solution?
Here is the code on pastebin.

There are at least a couple solutions.
1.) urlfilter-regex plugin
If you don't want to crawl the non-www pages at all, or else filter them at a later stage such as at index time, that is what the urlfilter-regex plugin is for. It lets you mark any URLs matching the regex patterns starting with "+" to be crawled. Anything that does not match a regex prefixed with a "+" will not be crawled. Additionally in case you want to specify a general pattern but exclude certain URLs, you can use a "-" prefix to specify URLs to subsequently exclude.
In your case you would use a rule like:
+^(https?://)?www\.
This will match anything that starts with:
https://www.
http://www.
www.
and therefore will only allow such URLs to be crawled.
Based on the fact that the URLs listed were already not being excluded given your regex-urlfilter, it means either the plugin wasn't turned on in your nutch-site.xml, or else it is not pointed at that file.
In nutch-site.xml you have to specify regex-urlfilter in the list of plugins, e.g.:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-basic|query-(basic|site|url)|response-(json|xml)|urlnormalizer-(pass|regex|basic)</value>
</property>
Additionally check that the property specifying which file to use is not over-written in nutch-site.xml and is correct in nutch-default.xml. It should be:
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
and regex-urlfilter.txt should be in the conf directory for nutch.
There is also the option to only perform the filtering at different steps, e.g., index-time, if you only want to filter than.
2.) solrdedup command
If the URLs point to the exact same page, which I am guessing is the case here, they can be removed by running the nutch command to delete duplicates after crawling:
http://wiki.apache.org/nutch/bin/nutch%20solrdedup
This will use the digest values computed from the text of each indexed page to find any pages that were the same and delete all but one.
However you would have to modify the plugin to change which duplicate is kept if you want to specifically keep the "www" ones.
3.) Write a custom indexing filter plugin
You can write a plugin that reads the URL field of a nutch document and converts it in any way you want before indexing. This would give you more flexible than using an existing plugin like urlnormalize-regex.
It is actually very easy to make plugins and add them to Nutch, which is one of the great things about it. As a starting point you can copy and look at one of the other plugins including with nutch that implement IndexingFilter, such as the index-basic plugin.
You can also find a lot of examples:
http://wiki.apache.org/nutch/WritingPluginExample
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Related

How to Make Relative Href Work in SvelteKit?

I want to build a Web app with SvelteKit with one page listing all items (with potential search query parameters), and then one page for each individual item. If I had to build this the old school way with everything generated in the backend, my paths would be /items/ for a the listing of items, and /items/123 for item 123, etc. That is, to go to the page of item 123, a link will with href="123" will work no matter if you are currently at the index (/items/) or at the page of one particuler item (/items/[id]).
With SvelteKit, if I create files routes/items/index.svelte and routes/items/[id].svelte, then routes/items/index.svelte will have path /items, without a trailing slash, and as a result a link with href="123" will lead to /123, resulting in a "not found" error.
This same link will work however from the page of an individual item, say, /items/456.
This is radically different from what you would have in the traditional HTML model, where a link from /items/ (or /items/index.html) would work the same as a link from /items/[id].html.
Now in svelte.config.js there is a trailingSlash option you can set to always so that routes/items/index.svelte corresponds to path /items/, but then routes/items/[id].svelte has path /items/[id]/ and we have the same problem again: one href value cannot work from both the index and the page of an individual item.
The only way I see right now is to use absolute path, but it's not very composable. My guess is that there is something I am doing wrong.
You're not missing anything - it's not currently possible in SvelteKit to have a trailing slash for some pages but not for others. There is an open GitHub issue you may be interested in that proposes adding additional trailingSlash options. This issue cites the exact problem you described:
The trailingSlash options introduced in #1404 don't make it straightforward to add trailing slashes to index pages but not to leaf pages. For example you might want to have /blog/ and /blog/some-post rather than /blog and /blog-some-post, since that allows the index page to contain relative links.
Until that feature is added, you'll need to use absolute paths.

Replacing part of ${url}'s from a sitemap in Jmeter

I have a jmeter test plan that goes to a site's sitemap.xml page, retrieves each url on that page with an XPath Extractor, then passes ${url} to a HTTP Request sampler within a ForEach Controller to send the results for each page to a file. This works great, except I just realized that the links on this sitemap.xml page are hardcoded. This is a problem when i want to test https://staging-website.com, but all of the links on sitemap.xml are all www.website.com pages. It seems like there must be a way to replace 'www.website.com' in each ${url} with 'staging-website.com' with regex or something, but I haven't been able to figure out how. Any suggestions would be greatly appreciated.
Add a BeanShell pre-processor to manipulate the url.
String sUrl = vars.get("url");
String sNewUrl = sUrl.replace("www.website.com", "https://staging-website.com");
log.info("sNewUrl:" + sNewUrl);
vars.put("url", sNewUrl);
You can also try to correlate the sitemap.xml with the regular expression extractor positioned till www.website.com so that you extract only the URL portion of the data instead of the full host name. Shouldn't you be having it already since the HTTPSampler only allows you to enter the URI segment and not the host name?
You can use __strReplace() function available via JMeter Plugins project like:
${__strReplace(${url},${url},staging-website.com,)}
Demo:
The easiest way to install JMeter Custom Functions (as well as any other plugins) is using JMeter Plugins Manager
I was able to replace the host within the string by putting
${__javaScript('${url}'.replace('www.website'\,'staging.website'))}
in the path input of the second http request sampler. The answers provided by Selva and Dimitri were more elegant, so if I have time in the future to come back to this I will give them another try. I really appreciate the help!

Alternative to <!--#include virtual="somefilename"-->

I have a website running an an old apache server with SSI enabled. My host wants to move to a new server which has SSI disabled for security reasons.
I have a whole lot of pages with Google Friendly urls which just have one line
<!--#include virtual="Url_Including_Search_String"-->
What is the best alternative to the SSI to keep my google friendly search strings returning the specified search result?
I can achieve most of the results with rewrite rules in the .htaccess file, however some search strings have a space in the keyword but the url doesn't. I can't do this with a rewrite rule
ie www.somedomain.com.au/SYDNEY.htm would have
<!--#include virtual="/search.php?keyword=SYDNEY&Submit=SEARCH"-->
However,the issue is
www.somedomain.com.au/POTTSPOINT.htm would have
<!--#include virtual="/search.php?keyword=POTTS+POINT&Submit=SEARCH"-->
A rewrite rule cannot detect where a space should be in a Suburb name, so hoping there is an alternative for <!--#include virtual=
I have looked at RewriteMap but don't think I can access the file I would need to put this in.
I would use Mod Rewrite to redirect any calls to non-existent files to your Search page.
For example:
http://example.com/SYDNEY redirects to
http://example.com/search.php?q=SYDNEY
(assuming there is not actually a /SYDNEY/ file at your server root.)
Then get rid of all those individual redirect pages.
As for the spaces, I'd modify my actual Search page to recognize (for example) "POTTSPOINT" and figure out that the space should be inserted. Basically compare the search term against a database of substitutions.

Get some or all (sub)page URLs of a website by a particular pattern

Suppose we have a website called http://www.example.com. I would like to get a list of its URI pages (just the URLs themselves, not URLs inside those URLs) - either all of them (including all subdomains and all subpages), or just some of them provided that they follow a particular globbing and/or regex pattern.
So, for example, I'm looking for something that gets all URLs (just the URL addresses themselves) that follow a pattern such as http://*.example.com/*. I'm aware that globbing in Linux (e.g. via the shell) is (mostly or fully?) limited to local files and directories (correct me if I'm wrong).
How can I achieve this?
I suppose that something related (although not quite the same?) is discussed here: How to find all links / pages on a website.
P.S. All of the URLs are part of a website that is made of static webpages only. I'm not sure if it's even possible to do the same thing with websites that are made of dynamic webpages... Also, I'm not sure if any URLs with query strings in them (e.g. http://www.example.com/?=abc&xyz) can be captured at all in such a way.

Is adding a robots.txt to my Django application the way to get listed by Google?

I have a website (Django) on a linux server, but Google isn't finding the site at all. I know that i don't have a robots.txt file on the server. Can someone tell me how to create one, what to write inside and where to place it? That would be a great help!
robot txt is not for google find your site. i think you must register your site to google and also add sitemap.xml
Webmaster Tools - Crawl URL ->
https://www.google.com/webmasters/tools/submit-url?continue=/addurl&pli=1
also see this for robot.txt
Three ways to add a robots.txt to your Django project | fredericiana
-> http://fredericiana.com/2010/06/09/three-ways-to-add-a-robots-txt-to-your-django-project/
what is robot.txt
It is great when search engines frequently visit your site and index your content but often there are cases when indexing parts of your online content is not what you want. For instance, if you have two versions of a page (one for viewing in the browser and one for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages (although in this case the only sure way for not indexing sensitive data is to keep it offline on a separate machine). Additionally, if you want to save some bandwidth by excluding images, stylesheets and javascript from indexing, you also need a way to tell spiders to keep away from these items.
One way to tell search engines which files and folders on your Web site to avoid is with the use of the Robots metatag. But since not all search engines read metatags, the Robots matatag can simply go unnoticed. A better way to inform search engines about your will is to use a robots.txt file.
from What is Robots.txt -> http://www.webconfs.com/what-is-robots-txt-article-12.php
robot.txt files are used to tell search engines which content should or should not be indexed. The robot.txt files is in no way required to be indexed by a search engine.
There are a number of thing to note about being indexed by search engines.
There is no guarantee you will ever be indexed
Indexing takes time, a month, two months, 6 months
To get indexed quicker try sharing a link to your site through blog comments etc to increase the chances of being found.
submit your site through the http://google.com/webmasters site, this will also give you hints and tips to make your site better as well as crawling stats.
location of robots.txt is same as view.py and this code
in view
def robots(request):
import os.path
BASE = os.path.dirname(os.path.abspath(__file__))
json_file = open(os.path.join(BASE , 'robots.txt'))
json_file.close()
return HttpResponse(json_file);
in url
(r'^robots.txt', 'aktel.views.robots'),