regex and scrapy in web crawling

regex and scrapy in web crawling - regex

I do a web crawling use scrapy. currently, it can extract the start url but not crawl later.
start_urls = ['https://cloud.cubecontentgovernance.com/retention/document_types.aspx']
allowed_domains = ['cubecontentgovernance.com']
rules = (
Rule(LinkExtractor(allow=("document_type_retention.aspx?dtid=1054456",)),
callback='parse_item', follow=True),
)
And the link i want to extract in the develop tool is:<a id="ctl00_body_ListView1_ctrl0_hyperNameLink" href="document_type_retention.aspx?dtid=1054456"> pricing </a>
the corresponding url is https://cloud.cubecontentgovernance.com/retention/document_type_retention.aspx?dtid=1054456
so what the allow field should be? thanks a lot

When I try to open the site of your start URL I get a login window.
Did you try to print response.body in the simple parse method for your start URL? I guess your Scrapy instance gets the same login window which does not have the URL you want to extract with the LinkExtractor.

Related

Python: scrape multiple pages of a message board

I would like to be able to scrape all the messages from message pages of Yahoo finance for a specific stock.
Here is an example page:
http://finance.yahoo.com/mb/AMD/
I like to be able to get all the messages in there.
If I click on the "Messages" button on the above link I go to this link:
http://finance.yahoo.com/mb/forumview/?&v=m&bn=d56b9fc4-b0f1-3e88-b1f5-e1c40c0067e7
which has more than 10 pages.
How can I use Python code to scrape this data by just knowing the stock symbol "AMD"?

The basics:
tickers = ['AMD', 'AAPL', 'GOOG']
for t in tickers:
url = 'http://finance.yahoo.com/mb/' + t + '/'
r = br.open(url)
html = r.read()
soup = BeautifulSoup(html)
print soup
The content you want is located within particular html tags. Use soup.find_all to get what you want. To move between pages, use Selenium.

Escape a news listing page while crawling a news website with Scrapy

I have to crawl articles from news websites, with certain keywords, I am using Scrapy for this task, for checking whether a keyword exist on a page or I extract the content from the page and search for the keyword. But I am facing the problem when there is listing page which has just listing of the news with links to the news page e.g. http://www.thehindu.com/features/cinema/ , I want to escape this page, but I am unable to find a way to check whether it's is a listing page or not.

There are several methods to make this function.
You can use regular expression to filter the listing page url in the parse function of spider;
def parse(self, response):
list_page_pat = re.compile("your pattern")
for url in extract_urls:
if list_page_pat.match(url) is None:
//continue process

Use anchored urls in django (anchored by the id of http objects)

My level in front-end development is pretty low.
Nevertheless, I want to implement the very wide-spread behaviour of having several parts anchored in one single page instead of several separate pages, and refer to these parts in the url.
So instead of having mysite.com/how_to_walk and mysite.com/how_to_run as two different pages and templates, I would like to have one page mysite.com/how_to_do_stuff and then depending on if you want to #walk or #run, refer to the html headers with the id field as suffixes of the url.
I don't really know how to do it with django. I'd like to create only one url dispatcher that - I guess - will look like that:
url(r'^how_to_stuff/#(?P<partId>[-\w]*)', views.how_to, name='how_to')
...and then I have to create a simple view, but how to refer to the id in the render() call, I have no idea.

I found the answer to my own question. The crucial element is that when the client (browser) goes for such an anchored url mysite.com/how_to_do_stuff#run, it sends to the server only the root url mysite.com/how_to_do_stuff and then applies the anchor to it locally. So you need:
A classic, simple url/view/template combination that loads the page mysite.com/how_to_do_stuff when it is asked by the client.
A way to send the client to these anchored pages and reference them for development. I do this through an other url/view couple that redirects the client to the right anchored url.
Below is the result:
In urls.py:
...
url(r'^how_to_do_stuff/(?P<part_id>[-\w]+)', views.how_to_redirect, name='how_to'),
url(r'^how_to_do_stuff', views.how_to)
In views.py:
def how_to_redirect(request, part_id):
return HttpResponseRedirect("/how_to_do_stuff/#"+part_id)
def how_to(request):
return render(request, "GWSite/how_to_do_stuff.html")
And then I refer to these in my templates through:
{% url "how_to" "run"}

From django project website
Take a look at how you they send the num var to views.py
# URLconf
from django.conf.urls import url
urlpatterns = [
url(r'^blog/$', 'blog.views.page'),
url(r'^blog/page(?P<num>[0-9]+)/$', 'blog.views.page'),
]
# View (in blog/views.py)
def page(request, num="1"):
# Output the appropriate page of blog entries, according to num.
...

How do I access my query when using Haystack/Elasticsearch?

I originally followed this tutorial (https://django-haystack.readthedocs.org/en/latest/tutorial.html), and have so far been able to highlight my query within my returned results. However, I want to highlight this same query when visiting the next page that I load with a separate template. Is there any way to save/access this query so that I can highlight the same results within this other template?
Whenever I try and include a statement like this, I get an error, which I'm thinking is because I'm not trying to access the query properly.
{% highlight section.body with query html_tag "span" css_class "highlighted" %}

You have to send to the next page, the information that you use to highlight the results in the first page. You can use the request.session to store the data and call it in the next page, or you can send the sqs by the url to the next page.
If you want to know how to manage the search query set, and how to edit that kind of stuff, I recommend you to read the views.py forms.py and the elasticsearch_backend in the haystack folder at: "/usr/local/lib/python2.7/dist-packages/haystack"
This is the url for the documentation of Django Session: Django Session
This is the url for the documentation to pass parameters trhough url: URL dispatcher

Want General Screen scraping application ( like facebook has ) for django

The use case of what I want is -
The user enters the link url ( like you might be doing in your facebook stat update box )
And a short description of this url with its title and a thumbnail appear.
( Yeah, the basic process is "sharing a link" )
How would you go about doing it ?
Call a django app which scrapes urls ( if yes then which one ? )
Do it using javascript? Is it possible ?

If you want a thumbnail have a look at python-webkit2png.

Have a look at BeautifulSoup, a very potent DOM crawler. With it, you could scrape URL's for tags and display them. The same goes for page titles.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

regex and scrapy in web crawling - regex

When I try to open the site of your start URL I get a login window. Did you try to print response.body in the simple parse method for your start URL? I guess your Scrapy instance gets the same login window which does not have the URL you want to extract with the LinkExtractor.

Related

Python: scrape multiple pages of a message board

Escape a news listing page while crawling a news website with Scrapy

Use anchored urls in django (anchored by the id of http objects)

How do I access my query when using Haystack/Elasticsearch?

Want General Screen scraping application ( like facebook has ) for django

Categories

Resources