Can I force scrapy to request an URL including commas without encoding it into %2C? The site (phorum) I want to crawl does not accept encoded URLs and redirecting me into root.
So, for example, I have site to parse: example.phorum.com/read.php?12,8
The url is being encoded into: example.phorum.com/read.php?12%2C8=
But when try to request this url, every time, I'm redirected into page with list of topics:
example.phorum.com/list.php?12
In those example URLs 12 is category number, 8 is topic number.
I tried to disable redirecting by disabling RedirectMiddleware:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
}
and in spider:
handle_httpstatus_list = [302, 403]
Moreover I tried to rewrite this URL and request it by sub parser:
Rules = [Rule(RegexLinkExtractor(allow=[r'(.*%2C.*)']), follow=True, callback='prepare_url')]
def prepare_url(self, response):
url = response.url
url = re.sub(r'%2C', ',', url)
if "=" in url[-1]:
url = url[:-1]
yield Request(urllib.unquote(url), callback = self.parse_site)
Where parse_site is target parser, which still calls using encoded URL.
Thanks in advance for any feedback
You can try canonicalize=False. Example iPython session:
In [1]: import scrapy
In [2]: from scrapy.contrib.linkextractors.regex import RegexLinkExtractor
In [3]: hr = scrapy.http.HtmlResponse(url="http://example.phorum.com", body="""link""")
In [4]: lx = RegexLinkExtractor(canonicalize=False)
In [5]: lx.extract_links(hr)
Out[5]: [Link(url='http://example.phorum.com/list.php?1,2', text=u'link', fragment='', nofollow=False)]
Related
My web app is deployed using nginx. I have view like below for the url /incoming/`.
def incoming_view(request):
incoming = request.GET["incoming"]
user = request.GET["user"]
...
When I hit my url /incoming/?incoming=hello&user=nkishore I am getting the response I need. but when I call this url using requests module with below code I am getting an error.
r = requests.get('http://localhost/incoming/?incoming=%s&user=%s'%("hello", "nkishore"))
print r.json()
I have checked the nginx logs and the request I got was /incoming/?incoming=hiu0026user=nkishore so in my view request.GET["user"] is failing to get user.
I am not getting what I am missing here, is it a problem with nginx or any other way to call in requests.
See Requests Docs for how to pass parameters, e.g.
>>> payload = {'key1': 'value1', 'key2': 'value2'}
>>> r = requests.get('https://httpbin.org/get', params=payload)
>>> print(r.url)
https://httpbin.org/get?key2=value2&key1=value1
Internally, Requests will likely escape the & ampersand with &. If you really want to do the URL manually, try as your URL string:
'http://localhost/incoming/?incoming=%s&user=%s'
I'm using the following to get all external Javascript references from a web page. How can I modify the code to search not only the url, but all pages of the website?
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
http = httplib2.Http()
status, response = http.request('https://stackoverflow.com')
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
if link.has_key('src'):
if 'http' in link['src']:
print link['src']
First attempt to make it scrape two pages deep below. Any advice on how to make it return only unique urls? As is, most are duplicates. (note that all internal links contain the word "index" on the sites I need to run this on.)
import httplib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
site = 'http://www.stackoverflow.com/'
http = httplib2.Http()
status, response = http.request(site)
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
if 'index' in link['href']:
page = site + link['href']
status, response = http.request(page)
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
if link.has_key('src'):
if 'http' in link['src']:
print "script" + " " + link['src']
for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
print "iframe" + " " + iframe['src']
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('a')):
if link.has_key('href'):
if 'index' in link['href']:
page = site + link['href']
status, response = http.request(page)
for link in BeautifulSoup(response, parseOnlyThese=SoupStrainer('script')):
if link.has_key('src'):
if 'http' in link['src']:
print "script" + " " + link['src']
for iframe in BeautifulSoup(response, parseOnlyThese=SoupStrainer('iframe')):
print "iframe" + " " + iframe['src']
Crawling websites is a vast subject. Deciding how to index content and crawl further deep into the website. It includes, content parsing like your rudimentary crawler or spider is doing. It is definitely non-trivial to write a bot similar in excellence to Google Bot. Professional crawling bots do a lot of work which may include
Monitor domain related changes to initiate crawl
Schedule sitemap lookup
Fetching web content (which is scope of this question)
Fetch set of links for further crawl
Adding weights or priorities to each URL
Monitoring when services from website go down
For just doing a crawl on specific website like Stackoverflow, I have modified your code for recursive crawling. It will be trivial to convert this code further to multi-threaded form. It uses bloomfilter to make sure it does not need to crawl same page again. Let me warn upfront, there will still be unexpected pitfalls while doing a crawl. Mature crawling software like Scrapy, Nutch or Heritrix do a way better job.
import requests
from bs4 import BeautifulSoup as Soup, SoupStrainer
from bs4.element import Tag
from bloom_filter import BloomFilter
from Queue import Queue
from urlparse import urljoin, urlparse
visited = BloomFilter(max_elements=100000, error_rate=0.1)
visitlist = Queue()
def isurlabsolute(url):
return bool(urlparse(url).netloc)
def visit(url):
print "Visiting %s" % url
visited.add(url)
return requests.get(url)
def parsehref(response):
if response.status_code == 200:
for link in Soup(response.content, 'lxml', parse_only=SoupStrainer('a')):
if type(link) == Tag and link.has_attr('href'):
href = link['href']
if isurlabsolute(href) == False:
href = urljoin(response.url, href)
href = str(href)
if href not in visited:
visitlist.put_nowait(href)
else:
print "Already visited %s" % href
else:
print "Got issues mate"
if __name__ == '__main__':
visitlist.put_nowait('http://www.stackoverflow.com/')
while visitlist.empty() != True:
url = visitlist.get()
resp = visit(url)
parsehref(resp)
visitlist.task_done()
visitlist.join()
When I use urllib, urllib2, or requests on Python 2.7, neither one ends up at the same URL as I do when I copy and paste the starting URL into Chrome or FireFox for Mac.
EDIT: I suspect this is because one has to be signed in to vk.com to be redirected. If this is the case, how do I add the sign-in to my script? Thanks!
Starting URL: https://oauth.vk.com/authorize?client_id=PRIVATE&redirect_uri=https://oauth.vk.com/blank.html&scope=friends&response_type=token&v=5.68
Actual final (redirected) URL: https://oauth.vk.com/blank.html#access_token=PRIVATE_TOKEN&expires_in=86400&user_id=PRIVATE
PRIVATE, PRIVATE_TOKEN = censored information
The following is one of several attempts at this:
import requests
APPID = 'PRIVATE'
DISPLAY_OPTION = 'popup' # or 'window' or 'mobile' or 'popup'
REDIRECT_URL = 'https://oauth.vk.com/blank.html'
SCOPE = 'friends' # https://vk.com/dev/permissions
RESPONSE_TYPE = 'token' # Documentation is vague on this. I don't know what
# other options there are, but given the context, i.e. that we want an
# "access token", I suppose this is the correct input
URL = 'https://oauth.vk.com/authorize?client_id=' + APPID + \
'&display='+ DISPLAY_OPTION + \
'&redirect_uri=' + REDIRECT_URL + \
'&scope=' + SCOPE + \
'&response_type=' + RESPONSE_TYPE + \
'&v=5.68'
# with requests
REQUEST = requests.get(URL)
RESPONSE_URL = REQUEST.url
I hope you notice whatever it is that's wrong with my code.
Extra info: I need the redirect because the PRIVATE_TOKEN value is necessary for further programming.
I tried some debugging but neither the interpreter nor IPython print out the debugging info.
Thanks!
The problem is the result of not being signed in in the Python environment.
Solution:
Use twill to create browser in Python and sign in.
Code:
from twill.commands import *
BROWSER = get_browser()
BROWSER.go(URL) # URL is the URL concatenated in the question
RESPONSE_URL = BROWSER.get_url()
I wanted to query Freebase API to get the list of teams José Mourinho has played for.
So, the URL i used on my browser is
https://www.googleapis.com/freebase/v1/mqlread?query=[{"name": "José Mourinho","/sports/pro_athlete/teams": [{"mid": null,"team": null,"to": null,"optional": true}]}]
However,
import json
import urllib
service_url="https://www.googleapis.com/freebase/v1/mqlread"
query = '[{"name": "' + "José Mourinho" + '","/sports/pro_athlete/teams": [{"mid": null,"team": null,"to": null,"optional": true}]}]'
url = service_url + '?' + 'query='+query
response = json.loads(urllib.urlopen(url).read())
Gives me an error saying,
UnicodeError: URL u'https://www.googleapis.com/freebase/v1/mqlread?query=[{"name": "Jos\xe9 Mourinho","/sports/pro_athlete/teams": [{"mid": null,"team": null,"to": null,"optional": true}]}]' contains non-ASCII characters
What is the solution to this?
I think you skipped over a little bit of the docs. Try this instead:
# coding=UTF-8
import json
import urllib
service_url = "https://www.googleapis.com/freebase/v1/mqlread"
query = [{
'/sports/pro_athlete/teams': [
{
'to': None,
'optional': True,
'mid': None,
'team': None
}
],
'name': 'José Mourinho'
}]
url = service_url + '?' + urllib.urlencode({'query': json.dumps(query)})
response = json.loads(urllib.urlopen(url).read())
print response
Rather than building the query string yourself, use json.dumps and urllib.urlencode to create it for you. They're good at this.
Note: if you can use the requests package, that last bit could be:
import requests
response = requests.get(service_url, params={'query': json.dumps(query)})
Then you get to skip the URL construction and escaping altogether!
I am using python 3.3 and the request module. And I am trying understand how to retrieve cookies from a response. The request documentation says:
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies['example_cookie_name']
That doesn't make sense, how do you get data from a cookie if you don't already know the name of the cookie? Maybe I don't understand how cookies work? If I try and print the response cookies I get:
<<class 'requests.cookies.RequestsCookieJar'>[]>
Thanks
You can retrieve them iteratively:
import requests
r = requests.get('http://example.com/some/cookie/setting/url')
for c in r.cookies:
print(c.name, c.value)
I got the following code from HERE:
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import cookielib
#Create a CookieJar object to hold the cookies
cj = cookielib.CookieJar()
#Create an opener to open pages using the http protocol and to process cookies.
opener = build_opener(HTTPCookieProcessor(cj), HTTPHandler())
#create a request object to be used to get the page.
req = Request("http://www.about.com")
f = opener.open(req)
#see the first few lines of the page
html = f.read()
print html[:50]
#Check out the cookies
print "the cookies are: "
for cookie in cj:
print cookie
See if this works for you.
Cookies are stored in headers as well. If this isn't working for you, check your headers for:
"Set-Cookie: Name=Value; [Expires=Date; Max-Age=Value; Path=Value]"