Generate start_urls list in Scrapy with constraints - python-2.7

I need to parse urls like the one below with Scrapy (ads from real estate agent)
http://ws.seloger.com/search.xml?idq=?&cp=72&idqfix=1&pxmin=30000&pxmax=60000&idtt=2&SEARCHpg=1&getDtCreationMax=1&tri=d_dt_crea
The response from the server is limited to 200 results whatever the Min/Max price you use in the url (See pxmin / pxman in url).
Therefore, i would like to use a function which generate urls for start_urls with the right price band so it doesn't go over 200 search results and so that urls cover a price range of say [0:1000000]
The function would do the following :
Take the first URL
Check number of results ("nbTrouvees" tag in the XML response)
adjust price band if results > 200 or add to start_urls list if < 200
The function increment the price band until it reach the price of 1,000,000.
Function return final start_urls list which will cover all properties for a given region.
This obviously means numerous requests to the server to find out the right price range plus all the request generated by Spider for the final scraping.
1) My first question therefore is : Is there a better way to tackle this in your point of view ?
2) My second question : I have tried to retrieve the content of one of these page with Scrapy, just to see how i could parse the "nbTrouvees" tag without using a spider but i'am stuck.
I tried using the TextResponse method but got nothing in return. I then tried the below but it fails as the method "body to unicode" doesn't exist for "Response" object.
>>>link = 'http://ws.seloger.com/search.xml? idq=1244,1290,1247&ci=830137&idqfix=1&pxmin=30000&pxmax=60000&idtt=2&SEARCHpg=1&getDtCreationMax=1&tri=d_dt_crea'
>>>xxs = XmlXPathSelector(Response(link))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site- packages/scrapy/selector/lxmlsel.py", line 31, in __init__
_root = LxmlDocument(response, self._parser)
File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site- packages/scrapy/selector/lxmldocument.py", line 27, in __new__
cache[parser] = _factory(response, parser)
File "/Users/Gilles/workspace/Immo-Lab/lib/python2.7/site- packages/scrapy/selector/lxmldocument.py", line 13, in _factory
body = response.body_as_unicode().strip().encode('utf8') or '<html/>'
AttributeError: 'Response' object has no attribute 'body_as_unicode'
Any idea? (fyi, It works with my spider)
Thank you
Gilles

Related

retriving data saved under HDF5 group as Carray

I am new to HDF5 file format and I have a data(images) saved in HDF5 format. The images are saved undere a group called 'data' which is under the root group as Carrays. what I want to do is to retrive a slice of the saved images. for example the first 400 or somthing like that. The following is what I did.
h5f = h5py.File('images.h5f', 'r')
image_grp= h5f['/data/'] #the image group (data) is opened
print(image_grp[0:400])
but I am getting the following error
Traceback (most recent call last):
File "fgf.py", line 32, in <module>
print(image_grp[0:40])
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
(/feedstock_root/build_artefacts/h5py_1496410723014/work/h5py-2.7.0/h5py/_objects.c:2846)
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
(/feedstock_root/build_artefacts/h5py_1496410723014/work/h5py
2.7.0/h5py/_objects.c:2804)
File "/..../python2.7/site-packages/h5py/_hl/group.py", line 169, in
__getitem__oid = h5o.open(self.id, self._e(name), lapl=self._lapl)
File "/..../python2.7/site-packages/h5py/_hl/base.py", line 133, in _e name = name.encode('ascii')
AttributeError: 'slice' object has no attribute 'encode'
I am not sure why I am getting this error but I am not even sure if I can slice the images which are saved as individual datasets.
I know this is an old question, but it is the first hit when searching for 'slice' object has no attribute 'encode' and it has no solution.
The error happens because the "group" is a group which does not have the encoding attribute. You are looking for the dataset element.
You need to find/know the key for the item that contains the dataset.
One suggestion is to list all keys in the group, and then guess which one it is:
print(list(image_grp.keys()))
This will give you the keys in the group.
A common case is that the first element is the image, so you can do this:
image_grp= h5f['/data/']
image= image_grp(image_grp.keys[0])
print(image[0:400])
yesterday I had a similar error and wrote this little piece of code to take my desired slice of h5py file.
import h5py
def h5py_slice(h5py_file, begin_index, end_index):
slice_list = []
with h5py.File(h5py_file, 'r') as f:
for i in range(begin_index, end_index):
slice_list.append(f[str(i)][...])
return slice_list
and it can be used like
the_desired_slice_list = h5py_slice('images.h5f', 0, 400)

How to fix this python code that performs login to website

am novice in python.Extracted below code to login to website from an online post, but getting error.
Please help to fix it and an explanation will help me
import requests
with requests.Session() as c:
EMAIL = 'noob.python#gmail.com'
PASSWORD = 'Dabc#123'
URL = 'https://www.linkedin.com/'
c.get(URL)
token = c.cookies['CsrfParam']
# This is the form data that the page sends when logging in
login_data = {loginCsrfParam:token, session_key:EMAIL, session_password:PASSWORD}
# Authenticate
r = c.post(URL, data=login_data)
# Try accessing a page that requires you to be logged in
r = c.get('https://www.linkedin.com/feed/')
print r.content
Am stuck with below Error:
C:\Python27>python website.py
Traceback (most recent call last):
File "website.py", line 8, in <module>
token = c.cookies['CsrfParam']
File "C:\Python27\lib\site-packages\requests\cookies.py", line 329, in __getitem__
return self._find_no_duplicates(name)
File "C:\Python27\lib\site-packages\requests\cookies.py", line 400, in _find_no_duplicates
raise KeyError('name=%r, domain=%r, path=%r' % (name, domain, path))
KeyError: "name='CsrfParam', domain=None, path=None"
The reason you're getting the error is that you're calling a value from a list which is empty. To call the first item in the list you say list[0]. In this case the list you're calling is empty so the first value doesn't exist hence the error.
I've ran your code and there is no #id value of 'recaptcha-token' which is why the code is returning an empty list. The only place a recaptcha token is needed is for signing up so I would suggest trying to log in without creating the authenticity_token.

tweepy.error.TweepError: [{u'message': u'Text parameter is missing.', u'code': 38}]

I am using tweepy twitter api for python, while using it i got some error,
I am not able to use send_direct_message(user/screen_name/user_id, text) this method
Here is my code:-
import tweepy
consumer_key='XXXXXXXXXXXXXXXXX'
consumer_secret='XXXXXXXXXXXXXXXXX'
access_token='XXXXXXXXXXXXXXXXX'
access_token_secret='XXXXXXXXXXXXXXXXX'
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
API = tweepy.API(auth)
user = API.get_user('SSPendse')
screen_name="CJ0495"
id = 773436067231956992
text="Message Which we have to send must have maximum 140 characters, If it is Greater then the message will be truncated upto 140 characters...."
# re = API.send_direct_message(screen_name, text)
re = API.send_direct_message(id, text)
print re
got following error:-
Traceback (most recent call last):
File "tweetApi.py", line 36, in <module>
re = API.send_direct_message(id, text)
File "/usr/local/lib/python2.7/dist-packages/tweepy/binder.py", line 245, in _call
return method.execute()
File "/usr/local/lib/python2.7/dist-packages/tweepy/binder.py", line 229, in execute
raise TweepError(error_msg, resp, api_code=api_error_code)
tweepy.error.TweepError: [{u'message': u'Text parameter is missing.', u'code': 38}]
What will be mistake done by me...???
I also have another problem related to tweepy, How can I moved to page two or got more followers in following code
i=1
user = API.get_user('Apple')
followers = API.followers(user.id,-1)
for follower in followers:
print follower,'\t',i
i=i+1
if I run the code I got only 5000 followers however if I use user.followers_count it gives 362705 followers (This no. may be changes while You check it) How can I see remaining followers
Thank You... :)
To solve your first error, replace re = API.send_direct_message(id, text) with re = API.send_direct_message(id, text=text). This function only works if you give it the message as a named parameter. The parameter name you need here is "text", so you might want to change your variable name to avoid confusion. Also, I just tried it, since you are sending a direct message, not a tweet, it will not be truncated to only the first 140 characters.
For your second question, this should do the trick, as explained here:
followers = []
for page in tweepy.Cursor(API.followers, screen_name='Apple').pages():
followers.extend(page)
time.sleep(60) #this is here to slow down your requests and prevent you from hitting the rate limit and encountering a new error
print(followers)

praw.errors.Forbidden: HTTP error when using Reddit get_flair_list

I am trying to get the comments for each Reddit post.
This is the way I am using to get flair list:
import praw
import webbrowser
r = praw.Reddit('OAuth testing example by u/_Daimon_ ver 0.1 see '
'https://praw.readthedocs.org/en/latest/'
'pages/oauth.html for source')
r.set_oauth_app_info(client_id='[client id]',
client_secret='[client secret]',
redirect_uri='http://localhost/authorize_callback')
url = r.get_authorize_url('uniqueKey', 'modflair', True)
webbrowser.open(url)
Then I got the code from the returned url, and I put the code in the access information, like this:
access_information = r.get_access_information('[returned code]')
Then when I am trying to call get_fliar_list() just like PRAW tutorial, like this:
item = next(r.get_subreddit('travel').get_flair_list())
It gives me an error, showing:
Traceback (most recent call last):
File "", line 1, in
File "/Library/Python/2.7/site-packages/praw-3.4.0-py2.7.egg/praw/init.py", line 565, in get_content
page_data = self.request_json(url, params=params)
File "", line 2, in request_json
File "/Library/Python/2.7/site-packages/praw-3.4.0-py2.7.egg/praw/decorators.py", line 116, in raise_api_exceptions
return_value = function(*args, **kwargs)
File "/Library/Python/2.7/site-packages/praw-3.4.0-py2.7.egg/praw/init.py", line 620, in request_json
retry_on_error=retry_on_error)
File "/Library/Python/2.7/site-packages/praw-3.4.0-py2.7.egg/praw/init.py", line 452, in _request
_raise_response_exceptions(response)
File "/Library/Python/2.7/site-packages/praw-3.4.0-py2.7.egg/praw/internal.py", line 208, in _raise_response_exceptions
raise Forbidden(_raw=response)
praw.errors.Forbidden: HTTP error
Here's the link of that PRAW tutorial: PRAW tutorial
Do you know how to solve this problem? How can I call get_flair_list() to get all the comments of a Reddit post?
There are a few things potentially going on here.
The first issue (And the most likely) Is that you are logging in wrong.
r = praw.Reddit('OAuth testing example by u/_Daimon_ ver 0.1 see '
'https://praw.readthedocs.org/en/latest/'
'pages/oauth.html for source')
DONT DO THIS, EVER
Even if the syntax in this command was correct (you dont have the commas), this makes your code INCREDIBLY hard to read. The most readable way is to have r = praw.Reddit('OAuth-testing') (the OAuth-testing bit can be whatever you want, as long as it is the same as in your praw.ini file.), then setup your praw.ini file as such:
[DEFAULT]
# A boolean to indicate whether or not to check for package updates.
check_for_updates=True
# Object to kind mappings
comment_kind=t1
message_kind=t4
redditor_kind=t2
submission_kind=t3
subreddit_kind=t5
# The URL prefix for OAuth-related requests.
oauth_url=https://oauth.reddit.com
# The URL prefix for regular requests.
reddit_url=https://www.reddit.com
# The URL prefix for short URLs.
short_url=https://redd.it
[OAuth-testing]
user_agent=USER-AGENT-HERE
username=REDDIT-ACCOUNT-USERNAME
password=REDDIT-ACCOUNT-PASSWORD
client_id=REDDIT-APP-CLIENT-ID
client_secret=REDDIT-APP-CLIENT-SECRET
Just as an additional note, get_flair_list() also requires moderator access, as documented here
Also, you ask at the bottom:
How can I call get_flair_list() to get all the comments of a Reddit post?
This would not be how you get all the comments of a post, if that is what you want to do you can read this tutorial in the PRAW docs.
If you have any further questions don't hesitate to comment on this answer and I or somebody else can answer it!

Django GAE cron giving DeadlineExceededError

I'm running Django 1.5 on GAE. I have a cron job that goes over several thousands of urls and grabs their "likes" count and saves it into DB. It can take easily more than 10 min to complete it. It works when I run it locally as a normal linux cron but fails with this error on GAE:
Traceback (most recent call last):
File "/base/data/home/runtimes/python27/python27_lib/versions/1/google/appengine/runtime/wsgi.py", line 266, in Handle
result = handler(dict(self._environ), self._StartResponse)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/django-1.5/django/core/handlers/wsgi.py", line 255, in __call__
response = self.get_response(request)
File "/base/data/home/runtimes/python27/python27_lib/versions/third_party/django-1.5/django/core/handlers/base.py", line 175, in get_response
signals.got_request_exception.send(sender=self.__class__, request=request)
DeadlineExceededError
My setup:
app.yaml:
- url: /tasks/*
script: myproject.wsgi.application
login: admin
cron.yaml:
- description: update_facebook_resource
url: /tasks/update_facebook_resource
schedule: every day 04:05
timezone: Europe/Berlin
views.py
def update_facebook_resource(request):
resources = Resource.objects.filter(inactive=0).order_by('id')
url_start = "https://graph.facebook.com/fql?q=select++total_count+from+link_stat+where+url%3D"
url_end = "&access_token=..."
for item in resources:
url = item.link
url_final = url_start+ "%22" + url + "%22" + url_end
data = json.load(urllib2.urlopen(url_final))
likes = data["data"][0]["total_count"]
query = Resource.objects.get(id=item.id)
query.facebook_likes = likes
query.save(update_fields=['facebook_likes'])
return http.HttpResponse('ok')
what and how should I change so GAE lets me complete it? I've read this https://developers.google.com/appengine/articles/deadlineexceedederrors but it doesn't give me what I need really.
thanks
It's not a question of just getting GAE to let you complete the function. When developing for App Engine, you do need to think in a slightly different way, precisely because of things like the request deadline. In your case, you need to break the task up into chunks, and process each of those chunks individually.
You don't say if you're using django-nonrel with the GAE datastore, or if you're using Cloud SQL and therefore the standard Django API. If the former, you can use query cursors to keep track of your progress through the Resources. After each chunk, you can use deferred tasks to trigger the next chunk, passing it the cursor so it picks up where the last one left off.