nutch regex, how to implement crawling strategy - regex

I try to establish the follwoing crawling behaviour in a nutch 1.8 in environment in the regex-urlfilter.txt file:
First:
Crawl Startpage (www.domainname.com) of the Site defined in the seed.txt file.
Second:
Additionally only crawl pages of two specific directories "directoryname1" (www.domainname.com/directoryname1/...) and "directoryname2" (www.domainname.com/directoryname2/...) linked from the start page and disregard everything else.
So far the filters I tried were either too general and the crawler crawled the start page and all other directories (not only directory 1 and 2), or were too strict, so that the crawler did not start at all (as the seed-URL did not match the regex of the urlfilter for the directory).
Thanks for your help chris

I solved it on my own. Here my solution to it:
regex for just the start page
+^.[.]de/$
regex for directory 1
+./directoryname1/.*
regex for directory 2
+./directoryname2/.

Related

Rewrite first folder to GET param for all php files

I am desperately looking for a rule to achieve the following:
Input URL request would be:
http://myserver.com/param/other/folders/and/files.php
It should redirect to
http://myserver.com/other/folders/and/files.php?p=param
similarly the basic index request
http://myserver.com/param/
would redirect to
http://myserver.com/?p=param
All my php files need the parameter, wherever they are. It'd be nice if JS and CSS files would be excluded but I guess it doesn't really matter since the /file.css?p=param would just be ignored and not cause a problem. I have found rules to map a folder to the GET parameter but none of them are working for php files deeper than the index file on the root level. Thanks so much in advance
Replace
http:\/\/([^\/]+)\/(\w+)\/(.*)
with
http:\/\/\1/\?p=\2\/\3
example regex page at https://regex101.com/r/sU6lR9/1

Exclude urls without 'www' from Nutch 1.7 crawl

I'm currently using Nutch 1.7 to crawl my domain. My issue is specific to URLs being indexed as www vs. non-www.
Specifically, after firing the crawl and index to Solr 4.5 then validating the results on the front-end with AJAX Solr, the search results page lists results/pages that are both 'www' and '' urls such as:
www.mywebsite.com
mywebsite.com
www.mywebsite.com/page1.html
mywebsite.com/page1.html
My understanding is that the url filtering aka regex-urlfilter.txt needs modification. Are there any regex/nutch experts that could suggest a solution?
Here is the code on pastebin.
There are at least a couple solutions.
1.) urlfilter-regex plugin
If you don't want to crawl the non-www pages at all, or else filter them at a later stage such as at index time, that is what the urlfilter-regex plugin is for. It lets you mark any URLs matching the regex patterns starting with "+" to be crawled. Anything that does not match a regex prefixed with a "+" will not be crawled. Additionally in case you want to specify a general pattern but exclude certain URLs, you can use a "-" prefix to specify URLs to subsequently exclude.
In your case you would use a rule like:
+^(https?://)?www\.
This will match anything that starts with:
https://www.
http://www.
www.
and therefore will only allow such URLs to be crawled.
Based on the fact that the URLs listed were already not being excluded given your regex-urlfilter, it means either the plugin wasn't turned on in your nutch-site.xml, or else it is not pointed at that file.
In nutch-site.xml you have to specify regex-urlfilter in the list of plugins, e.g.:
<property>
<name>plugin.includes</name>
<value>protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-basic|query-(basic|site|url)|response-(json|xml)|urlnormalizer-(pass|regex|basic)</value>
</property>
Additionally check that the property specifying which file to use is not over-written in nutch-site.xml and is correct in nutch-default.xml. It should be:
<property>
<name>urlfilter.regex.file</name>
<value>regex-urlfilter.txt</value>
<description>Name of file on CLASSPATH containing regular expressions
used by urlfilter-regex (RegexURLFilter) plugin.</description>
</property>
and regex-urlfilter.txt should be in the conf directory for nutch.
There is also the option to only perform the filtering at different steps, e.g., index-time, if you only want to filter than.
2.) solrdedup command
If the URLs point to the exact same page, which I am guessing is the case here, they can be removed by running the nutch command to delete duplicates after crawling:
http://wiki.apache.org/nutch/bin/nutch%20solrdedup
This will use the digest values computed from the text of each indexed page to find any pages that were the same and delete all but one.
However you would have to modify the plugin to change which duplicate is kept if you want to specifically keep the "www" ones.
3.) Write a custom indexing filter plugin
You can write a plugin that reads the URL field of a nutch document and converts it in any way you want before indexing. This would give you more flexible than using an existing plugin like urlnormalize-regex.
It is actually very easy to make plugins and add them to Nutch, which is one of the great things about it. As a starting point you can copy and look at one of the other plugins including with nutch that implement IndexingFilter, such as the index-basic plugin.
You can also find a lot of examples:
http://wiki.apache.org/nutch/WritingPluginExample
http://sujitpal.blogspot.com/2009/07/nutch-custom-plugin-to-parse-and-add.html

Nutch regex doesn't crawl the way I want it to

Ok, I asked this already, but I guess I didn't ask it to the way stackoverflow expects. Hopefully I will get more luck this time and an answer.
I am trying to run nutch to crawl this site: http://www.tigerdirect.com/
I want it to crawl that site and all sublinks.
The problem is its not working. In my reg-ex file I tried a couple of things, but none of them worked:
+^http://([a-z0-9]*\.)*tigerdirect.com/
+^http://tigerdirect.com/([a-z0-9]*\.)*
my urls.txt is:
http://tigerdirect.com
Basically what I am trying to accomplish is to crawl all the product pages on their website so I can create a search engine (I am using solr) of electronic products. Eventually I want to crawl bestbuy.com, newegg.com and other sites as well.
BTW, I followed the tutorial from here: http://wiki.apache.org/nutch/NutchTutorial and I am using the script mentioned in session 3.3 (after fixing a bug it had).
I have a background in java and android and bash so this is a little new to me. I used to do regex in perl 5 years ago, but that is all forgotten.
Thanks!
According to your comments I see that you have crawled something before and this is why your Nutch starts to crawl Wikipedia.
When you crawl something with Nutch it records some metada at a table (if you use Hbase it is a table named webpage) When you finish a crawling and start a new one that table is scanned and if there is a record that has a metada says "this record can be fetched again because next fetch time is passed" Nutch starts to fetch that urls and also your new urls.
So if you want to have just http://www.tigerdirect.com/ crawled at your system you have to clean up that table first. If you use Hbase start shell:
./bin/hbase shell
and disable table:
disable 'webpage'
and finally drop it:
drop 'webpage'
I could truncate that table but removed it.
Next thing is putting that into your seed.txt:
http://www.tigerdirect.com/
open regex-urlfilter.txt that is located at:
nutch/runtime/local/conf
write that line into it:
+^http://([a-z0-9]*\.)*www.tigerdirect.com/([a-z0-9]*\.)*
you will put that line instead of +.
I have indicated to crawl subdomains of tigerdirect, it is up to you.
After that you can send it into solr to index and make a search on it. I have tried it and works however you may have some errors at Nutch side but it is another topic to talk about.
You've got a / at the end of both of your regexes but your URL doesn't.
http://tigerdirect.com/ will match, http://tigerdirect.com will not.
+^http://tigerdirect.com/([a-z0-9]*\.)*
Try moving that tailing slash inside the parens
+^http://tigerdirect.com(/[a-z0-9]*\.)*

Configuring LucidWorks Include Paths to only crawl certain file types

I'm trying to configure the LucidWorks web data source to only index certain file types. However, when I set Include paths to .*\.html to only crawl .html files (as a simplified example), it only ends up indexing the top level folder. Crawl depth is set to -1 and when I leave Include paths blank, it crawls the whole sub-tree as expected.
I've looked at their documentation for creating a web data source, and for Using Regular Expressions, and can't find a reason why .*\.html would not work, since .* should match any character.
As I was proofreading the question, I had an idea which was the correct solution. Posting it here for posterity.
The content being crawled is a file share, so it relies on directory listing of the web server, which was filtered out because it doesn't have a .html extension. So simply adding .*/ to the Include paths fixed the problem.

Is adding a robots.txt to my Django application the way to get listed by Google?

I have a website (Django) on a linux server, but Google isn't finding the site at all. I know that i don't have a robots.txt file on the server. Can someone tell me how to create one, what to write inside and where to place it? That would be a great help!
robot txt is not for google find your site. i think you must register your site to google and also add sitemap.xml
Webmaster Tools - Crawl URL ->
https://www.google.com/webmasters/tools/submit-url?continue=/addurl&pli=1
also see this for robot.txt
Three ways to add a robots.txt to your Django project | fredericiana
-> http://fredericiana.com/2010/06/09/three-ways-to-add-a-robots-txt-to-your-django-project/
what is robot.txt
It is great when search engines frequently visit your site and index your content but often there are cases when indexing parts of your online content is not what you want. For instance, if you have two versions of a page (one for viewing in the browser and one for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages (although in this case the only sure way for not indexing sensitive data is to keep it offline on a separate machine). Additionally, if you want to save some bandwidth by excluding images, stylesheets and javascript from indexing, you also need a way to tell spiders to keep away from these items.
One way to tell search engines which files and folders on your Web site to avoid is with the use of the Robots metatag. But since not all search engines read metatags, the Robots matatag can simply go unnoticed. A better way to inform search engines about your will is to use a robots.txt file.
from What is Robots.txt -> http://www.webconfs.com/what-is-robots-txt-article-12.php
robot.txt files are used to tell search engines which content should or should not be indexed. The robot.txt files is in no way required to be indexed by a search engine.
There are a number of thing to note about being indexed by search engines.
There is no guarantee you will ever be indexed
Indexing takes time, a month, two months, 6 months
To get indexed quicker try sharing a link to your site through blog comments etc to increase the chances of being found.
submit your site through the http://google.com/webmasters site, this will also give you hints and tips to make your site better as well as crawling stats.
location of robots.txt is same as view.py and this code
in view
def robots(request):
import os.path
BASE = os.path.dirname(os.path.abspath(__file__))
json_file = open(os.path.join(BASE , 'robots.txt'))
json_file.close()
return HttpResponse(json_file);
in url
(r'^robots.txt', 'aktel.views.robots'),