Python request module - Getting response cookies - cookies

I am using python 3.3 and the request module. And I am trying understand how to retrieve cookies from a response. The request documentation says:
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies['example_cookie_name']
That doesn't make sense, how do you get data from a cookie if you don't already know the name of the cookie? Maybe I don't understand how cookies work? If I try and print the response cookies I get:
<<class 'requests.cookies.RequestsCookieJar'>[]>
Thanks

You can retrieve them iteratively:
import requests
r = requests.get('http://example.com/some/cookie/setting/url')
for c in r.cookies:
print(c.name, c.value)

I got the following code from HERE:
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import cookielib
#Create a CookieJar object to hold the cookies
cj = cookielib.CookieJar()
#Create an opener to open pages using the http protocol and to process cookies.
opener = build_opener(HTTPCookieProcessor(cj), HTTPHandler())
#create a request object to be used to get the page.
req = Request("http://www.about.com")
f = opener.open(req)
#see the first few lines of the page
html = f.read()
print html[:50]
#Check out the cookies
print "the cookies are: "
for cookie in cj:
print cookie
See if this works for you.

Cookies are stored in headers as well. If this isn't working for you, check your headers for:
"Set-Cookie: Name=Value; [Expires=Date; Max-Age=Value; Path=Value]"

Related

How to make a request to get minimum data from server?

I want to make a HTTP request, so that I get minimum data from the server. For eg : If the user device is a mobile, the server will send less data.
I was doing this in python ::
req = urllib2.Request(__url, headers={'User-Agent' : "Magic Browser"})
html = urllib2.urlopen(req).read()
But it still takes some time to download all this.
If it helps this is the domain from which I want to download pages : https://in.bookmyshow.com
Is there any other way so that I can download a page, quickly with minimum data? Is it even possible?
you can use request for upload files get datas example for get cookies:
import requests
r = requests.get('https://in.bookmyshow.com')
print r.cookies.get_dict()
or for upload file:
import requests
file = {'file':('filename.txt', open('filename.txt', 'r'), multipart/from-data)}
data = {
"ButtonValueNameInHtml" : "Submit",
}
r = requests.post('https://in.bookmyshow.com', files=file, data=data)
replace in.bookmyshow.com by your own url
you can do many Thigs With requests

Why does this request return an empty string?

I have the following function which makes a get request to a url.
def fetch_data(session = None):
s = session or requests.Session()
url = 'http://www.moldelectrica.md/utils/load4.php'
response = s.get(url)
print response.status_code
data = response.text
return data
I expect to get a string back in the form.
169,26,0,19,36,151,9,647,26,15,0,0,0,0,0,150,7,27,-11,-27,-101,-19,-32,-78,-58,0,962,866,96,0,50.02
But instead I get an empty unicode string. The status code returned is 200.
I've looked at the request headers but nothing in them suggests that any headers will require being set manually. Cookies are used but I think the session object should handle that.
Figured it out. As I said this url provides data for a display so it wouldn't normally be visited directly. Usually it would be requested by the display page and that page would provide a cookie.
So the solution is to make a request to the display url then reuse the session and make another request to the data url.

Test scrapy spider still working - find page changes

How can I test a scrapy spider against online data.
I now from this post that it is possible to test a spider against offline data.
My target is to check if my spider still extracts the right data from a page, or if the page changed. I extract the data via XPath and sometimes the page receives and update and my scraper is no longer working. I would love to have the test as close to my code as possible, eg. using the spider and scrapy setup and just hook into the parse method.
Referring to the link you provided, you could try this method for online testing which I used for my problem which was similar to yours. All you have to do is instead of reading the requests from a file you can use the Requests library to fetch the live webpage for you and compose a scrapy response from the response you get from Requests like below
import os
import requests
from scrapy.http import Response, Request
def online_response_from_url (url=None):
if not url:
url = 'http://www.example.com'
request = Request(url=url)
oresp = requests.get(url)
response = TextResponse(url=url, request=request,
body=oresp.text, encoding = 'utf-8')
return response

Scrapy can't follow url with commas without encoding it

Can I force scrapy to request an URL including commas without encoding it into %2C? The site (phorum) I want to crawl does not accept encoded URLs and redirecting me into root.
So, for example, I have site to parse: example.phorum.com/read.php?12,8
The url is being encoded into: example.phorum.com/read.php?12%2C8=
But when try to request this url, every time, I'm redirected into page with list of topics:
example.phorum.com/list.php?12
In those example URLs 12 is category number, 8 is topic number.
I tried to disable redirecting by disabling RedirectMiddleware:
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None,
}
and in spider:
handle_httpstatus_list = [302, 403]
Moreover I tried to rewrite this URL and request it by sub parser:
Rules = [Rule(RegexLinkExtractor(allow=[r'(.*%2C.*)']), follow=True, callback='prepare_url')]
def prepare_url(self, response):
url = response.url
url = re.sub(r'%2C', ',', url)
if "=" in url[-1]:
url = url[:-1]
yield Request(urllib.unquote(url), callback = self.parse_site)
Where parse_site is target parser, which still calls using encoded URL.
Thanks in advance for any feedback
You can try canonicalize=False. Example iPython session:
In [1]: import scrapy
In [2]: from scrapy.contrib.linkextractors.regex import RegexLinkExtractor
In [3]: hr = scrapy.http.HtmlResponse(url="http://example.phorum.com", body="""link""")
In [4]: lx = RegexLinkExtractor(canonicalize=False)
In [5]: lx.extract_links(hr)
Out[5]: [Link(url='http://example.phorum.com/list.php?1,2', text=u'link', fragment='', nofollow=False)]

Getting HTTP 403 when pulling data from gdata api within a Django view

When trying to pull data from the youtube gdata api using urllib2.urlopen, I receive a HTTP 403 error. I've turned off the CSRF middleware for debugging purposes, and the view I'm using looks like this:
def videos (request):
params = {}
youtube_search_url = 'http://gdata.youtube.com/feeds/api/videos'
params['order_by'] = 'relevance'
params['max_results'] = 10
params['safeSearch'] = 'strict'
params['v'] = 2
params['key'] = '<developer key>'
f = urllib2.urlopen(youtube_search_url, encoded_params)
...
Any ideas?
When you make an API request, use the X-GData-Key request header to specify your developer key as shown in the following example:
X-GData-Key: key=<developer_key>
Include the key query parameter in the request URL.
http://gdata.youtube.com/feeds/api/videos?q=SEARCH_TERM&key=DEVELOPER_KEY
^^ Straight from the horse's mouth. You are missing the X-GData-Key request header.
The key seems to be required in both url and the header, so given your previous code try this:
req = urllib2.Request(youtube_search_url, encoded_params, { "X-GData-Key": '<developer key>' })
f = urllib2.urlopen(req)