Test scrapy spider still working - find page changes - unit-testing

How can I test a scrapy spider against online data.
I now from this post that it is possible to test a spider against offline data.
My target is to check if my spider still extracts the right data from a page, or if the page changed. I extract the data via XPath and sometimes the page receives and update and my scraper is no longer working. I would love to have the test as close to my code as possible, eg. using the spider and scrapy setup and just hook into the parse method.

Referring to the link you provided, you could try this method for online testing which I used for my problem which was similar to yours. All you have to do is instead of reading the requests from a file you can use the Requests library to fetch the live webpage for you and compose a scrapy response from the response you get from Requests like below
import os
import requests
from scrapy.http import Response, Request
def online_response_from_url (url=None):
if not url:
url = 'http://www.example.com'
request = Request(url=url)
oresp = requests.get(url)
response = TextResponse(url=url, request=request,
body=oresp.text, encoding = 'utf-8')
return response

Related

webhook response with django channel's AsyncHttpConsumer on Heroku

I am trying to migrate my code from wsgi to use asgi/channel's asyncHttpConsumer. I can get a http_request from diaglogflow. Then I can either use send() or send_response() for response.
I can do something like
await self.send_response(200, b'response text',
headers=[(b"Content-Type", b"text/plain"),
])
and my heroku server sends it out normally, but dialogflow does not get anything back.
I have another wsgi application that just uses
from django.http import JsonResponse
...
fulfillmentText = {'fulfillmentText': "server works correctly"}
return JsonResponse(fulfillmentText, safe=False)
where this actually returns to dialogflow correctly.
I tried to use JsonResponse on the asgi/channel side but it just gives me an error that just basically say I'm not using send_response correctly.
What do I need to do to convert my response correctly on the asyncHttpConsumer side?
Figured it out.
I just needed to convert my dict to bytes using
response = json.dumps(mydict).encode('utf-8')
await self.send_response(200, fulfillmentText, headers=[(b"Content-Type", b"text/plain"),])

how to download from an html link (href) in python?

how to download movie from a link (that normally start with click ) this is the html code for the download File in the web page. i am looking to do so in python code as a client that download multiply times the movie but not saving it (just simulating traffic on the web page)
In case you have the url:
import requests
url="http://....."
response = requests.get(url)
You can print the response or parse it:
response.headers is dict of the headers response.
content is the content of the response

Expect status [201] from Google Storage. But got status 302

I was trying to upload a file to Google Cloud Storage with the help of a django application. I wrote this create_gcs_file function that gets called in my view:
import cloudstorage as gcs
from google.appengine.ext import blobstore
def create_gcs_file(filename, data):
with gcs.open(filename, 'w') as f:
f.write(data)
blobstore_filename = '/gs' + filename
return blobstore.create_gs_key(blobstore_filename)
I call this function in a view and pass it a filename, and some file.read() data as parameters. Here is the view code I have written that makes use of this.
But when I upload the file I get this error:
Expect status [201] from Google Storage. But got status 302
The debug page shows that the error occurs at this line:
with gcs.open(filename, 'w') as f:
P.S: I get this error when running my app with the Google App Engine SDK 1.8.5 locally.
I also had gone though this same situation, In my case the real problem was in model field.
Actual field was like:
photo = models.ImageField(upload_to='../../static/cars')
And later I change the upload to option as;
photo = models.ImageField(upload_to='static/cars')
Now my problem solved and it is working fine now.

Python request module - Getting response cookies

I am using python 3.3 and the request module. And I am trying understand how to retrieve cookies from a response. The request documentation says:
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies['example_cookie_name']
That doesn't make sense, how do you get data from a cookie if you don't already know the name of the cookie? Maybe I don't understand how cookies work? If I try and print the response cookies I get:
<<class 'requests.cookies.RequestsCookieJar'>[]>
Thanks
You can retrieve them iteratively:
import requests
r = requests.get('http://example.com/some/cookie/setting/url')
for c in r.cookies:
print(c.name, c.value)
I got the following code from HERE:
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import cookielib
#Create a CookieJar object to hold the cookies
cj = cookielib.CookieJar()
#Create an opener to open pages using the http protocol and to process cookies.
opener = build_opener(HTTPCookieProcessor(cj), HTTPHandler())
#create a request object to be used to get the page.
req = Request("http://www.about.com")
f = opener.open(req)
#see the first few lines of the page
html = f.read()
print html[:50]
#Check out the cookies
print "the cookies are: "
for cookie in cj:
print cookie
See if this works for you.
Cookies are stored in headers as well. If this isn't working for you, check your headers for:
"Set-Cookie: Name=Value; [Expires=Date; Max-Age=Value; Path=Value]"

Extract "Liked" songs from Pandora using python

I am attempting to use Python's urllib2 to extract info on my "liked" tracks in Pandora. I'm getting discrepencies when comparing the HTML yielded from the following code and the HTML seen via Chrome's inspect element:
import urllib2
headers={ 'User-Agent' : 'Mozilla/5.0' }
url='http://www.pandora.com/profile/likes/myusername'
request=urllib2.Request(url,None,headers)
response = urllib2.urlopen(request)
html = response.read()
I'm thinking this might be due to the lack of authentication even though I'm still able to load the same page logged out using Chrome's incognito mode.
So I added the following lines to attempt to use basic authentication on my request:
SERVER='pandora.com'
authinfo = urllib2.HTTPPasswordMgrWithDefaultRealm()
authinfo.add_password(None, SERVER, "login", "password")
handler=urllib2.HTTPBasicAuthHandler(authinfo)
myopener=urllib2.build_opener(handler)
opened=urllib2.install_opener(myopener)
headers={ 'User-Agent' : 'Mozilla/5.0' }
url='http://www.pandora.com/profile/likes/chris.r.armstrong'
request=urllib2.Request(url,None,headers)
response = urllib2.urlopen(request)
html = response.read()
Still not getting the right HTML response back. Any suggestions?
The DOM (HTML page), you see inside the browser is not the payload of the HTTP request. Once an HTTP request is been made by a browser, depending on how complex a page is, a number of transformations happen. At the basic level, the parser might reorder and/or reorganize the content as mandated by HTML5 parsing algorithm. Then JS scripts and XMLHttpRequests will modify and add content to the DOM.
If you really need the DOM as seen in the browser, you might want to use a webdriver for being able to get back what the browser sees and not only what the HTTP client sees.
Hope it helps.