Nutch regex doesn't crawl the way I want it to - regex

Ok, I asked this already, but I guess I didn't ask it to the way stackoverflow expects. Hopefully I will get more luck this time and an answer.
I am trying to run nutch to crawl this site: http://www.tigerdirect.com/
I want it to crawl that site and all sublinks.
The problem is its not working. In my reg-ex file I tried a couple of things, but none of them worked:
+^http://([a-z0-9]*\.)*tigerdirect.com/
+^http://tigerdirect.com/([a-z0-9]*\.)*
my urls.txt is:
http://tigerdirect.com
Basically what I am trying to accomplish is to crawl all the product pages on their website so I can create a search engine (I am using solr) of electronic products. Eventually I want to crawl bestbuy.com, newegg.com and other sites as well.
BTW, I followed the tutorial from here: http://wiki.apache.org/nutch/NutchTutorial and I am using the script mentioned in session 3.3 (after fixing a bug it had).
I have a background in java and android and bash so this is a little new to me. I used to do regex in perl 5 years ago, but that is all forgotten.
Thanks!

According to your comments I see that you have crawled something before and this is why your Nutch starts to crawl Wikipedia.
When you crawl something with Nutch it records some metada at a table (if you use Hbase it is a table named webpage) When you finish a crawling and start a new one that table is scanned and if there is a record that has a metada says "this record can be fetched again because next fetch time is passed" Nutch starts to fetch that urls and also your new urls.
So if you want to have just http://www.tigerdirect.com/ crawled at your system you have to clean up that table first. If you use Hbase start shell:
./bin/hbase shell
and disable table:
disable 'webpage'
and finally drop it:
drop 'webpage'
I could truncate that table but removed it.
Next thing is putting that into your seed.txt:
http://www.tigerdirect.com/
open regex-urlfilter.txt that is located at:
nutch/runtime/local/conf
write that line into it:
+^http://([a-z0-9]*\.)*www.tigerdirect.com/([a-z0-9]*\.)*
you will put that line instead of +.
I have indicated to crawl subdomains of tigerdirect, it is up to you.
After that you can send it into solr to index and make a search on it. I have tried it and works however you may have some errors at Nutch side but it is another topic to talk about.

You've got a / at the end of both of your regexes but your URL doesn't.
http://tigerdirect.com/ will match, http://tigerdirect.com will not.
+^http://tigerdirect.com/([a-z0-9]*\.)*
Try moving that tailing slash inside the parens
+^http://tigerdirect.com(/[a-z0-9]*\.)*

Related

content empty when using scrapy

Thanks for everyone in advance.
I encountered a problem when using Scrapy on Python 2.7.
The webpage I tried to crawl is a discussion board for Chinese stock market.
When I tried to get the first number "42177" just under the banner of this page (the number you see on that webpage may not be the number you see in the picture shown here, because it represents the number of times this article has been read and is updated realtime...), I always get an empty content. I am aware that this might be the dynamic content issue, but yet don't have a clue how to crawl it properly.
The code I used is:
item["read"] = info.xpath("div[#id='zwmbti']/div[#id='zwmbtilr']/span[#class='tc1']/text()").extract()
I think the xpath is set correctly and I have checked the return value of this response and it indeed told me that there is nothing under this directory. Results shown here:'read': [u'<div id="zwmbtilr"></div>']
If it has something, there should be something between <div id="zwmbtilr"> and </div>.
Really appreciated if you guys share any thoughts on this!
I just opened your link in Firefox with NoScript enabled. There nothing inside the <div #id='zwmbtilr'></div>. If I enable the javascripts, I can see the content you want. So, as you already new, it is a dynamic content issue.
Your first option is try to identify the request generated by javascript. If you can do that, you can send the same request from scrapy. If you can't do it, the next option is usually to use some package with javascript/browser emulation or someting like that. Something like ScrapyJS or Scrapy + Selenium.

Analytics Goal Funnel Regex doesn't recognize "example.html?p=2"

I have my goal funnel set up and this is the regex for one of the stages: ^/shop/(.*)
This will match pages such as /shop/collections/art.html but when I look at the goal funnel, it says people are dropping out by going to pages like /shop/collections/art.html?p=2. Notice the ?p=2 is the only difference here.
I tried to do it as ^/shop/((.|\?)*) but I'm not sure that's fixing it.
How do I fix this?

Solr admin shows number of indexes(numDocs) to be greater than the number of files I processed

When I process 56 files with Solr it says, 'numDoc:74'. I have no clue as to why more indexes would exists than files processed, but one explanation I came up with is that the indexes of a couple of the processed files are too big, so they are split up into multiple indexes(I use rich content extraction on all processed files) . It was just a thought, so I don't want to take it as true right off the bat. Can anyone give an alternate explanation or confirm this one?
using Django + Haystack + Solr.
Many thanks
Your terminology is unfortunately is all incorrect but the troubleshooting process should be simple enough. Solr comes with admin console. Usually at http:// [ localhost or domain ]:8983/solr/ . Go there, find your collection in the drop-down (I am assuming Solr 4) and run the default query in the Query screen. That should give you all your documents and you can see what the extras are.
I suspect you may have some issues with your unique ids and/or reindexing. But with the small number of documents you can really just review what you are actually storing in Solr and figure out what is not correct.

nutch regex, how to implement crawling strategy

I try to establish the follwoing crawling behaviour in a nutch 1.8 in environment in the regex-urlfilter.txt file:
First:
Crawl Startpage (www.domainname.com) of the Site defined in the seed.txt file.
Second:
Additionally only crawl pages of two specific directories "directoryname1" (www.domainname.com/directoryname1/...) and "directoryname2" (www.domainname.com/directoryname2/...) linked from the start page and disregard everything else.
So far the filters I tried were either too general and the crawler crawled the start page and all other directories (not only directory 1 and 2), or were too strict, so that the crawler did not start at all (as the seed-URL did not match the regex of the urlfilter for the directory).
Thanks for your help chris
I solved it on my own. Here my solution to it:
regex for just the start page
+^.[.]de/$
regex for directory 1
+./directoryname1/.*
regex for directory 2
+./directoryname2/.

Is adding a robots.txt to my Django application the way to get listed by Google?

I have a website (Django) on a linux server, but Google isn't finding the site at all. I know that i don't have a robots.txt file on the server. Can someone tell me how to create one, what to write inside and where to place it? That would be a great help!
robot txt is not for google find your site. i think you must register your site to google and also add sitemap.xml
Webmaster Tools - Crawl URL ->
https://www.google.com/webmasters/tools/submit-url?continue=/addurl&pli=1
also see this for robot.txt
Three ways to add a robots.txt to your Django project | fredericiana
-> http://fredericiana.com/2010/06/09/three-ways-to-add-a-robots-txt-to-your-django-project/
what is robot.txt
It is great when search engines frequently visit your site and index your content but often there are cases when indexing parts of your online content is not what you want. For instance, if you have two versions of a page (one for viewing in the browser and one for printing), you'd rather have the printing version excluded from crawling, otherwise you risk being imposed a duplicate content penalty. Also, if you happen to have sensitive data on your site that you do not want the world to see, you will also prefer that search engines do not index these pages (although in this case the only sure way for not indexing sensitive data is to keep it offline on a separate machine). Additionally, if you want to save some bandwidth by excluding images, stylesheets and javascript from indexing, you also need a way to tell spiders to keep away from these items.
One way to tell search engines which files and folders on your Web site to avoid is with the use of the Robots metatag. But since not all search engines read metatags, the Robots matatag can simply go unnoticed. A better way to inform search engines about your will is to use a robots.txt file.
from What is Robots.txt -> http://www.webconfs.com/what-is-robots-txt-article-12.php
robot.txt files are used to tell search engines which content should or should not be indexed. The robot.txt files is in no way required to be indexed by a search engine.
There are a number of thing to note about being indexed by search engines.
There is no guarantee you will ever be indexed
Indexing takes time, a month, two months, 6 months
To get indexed quicker try sharing a link to your site through blog comments etc to increase the chances of being found.
submit your site through the http://google.com/webmasters site, this will also give you hints and tips to make your site better as well as crawling stats.
location of robots.txt is same as view.py and this code
in view
def robots(request):
import os.path
BASE = os.path.dirname(os.path.abspath(__file__))
json_file = open(os.path.join(BASE , 'robots.txt'))
json_file.close()
return HttpResponse(json_file);
in url
(r'^robots.txt', 'aktel.views.robots'),