The scraper I deployed on Scrapy cloud is producing an unexpected result compared to the local version.
My local version can easily extract every field of a product item (from an online retailer) but on the scrapy cloud, the field "ingredients" and the field "list of prices" are always displayed as empty.
You'll see in a picture attached the two elements I'm always having empty as a result whereas it's perfectly working
I'mu using Python 3 and the stack was configured with a scrapy:1.3-py3 configuration.
I thought first it was in a issue with the regex and unicode but seems not.
So i tried everything : ur, ur RE.ENCODE .... and didn't work.
For the ingredients part, my code is the following :
data_box=response.xpath('//*[#id="ingredients"]').css('div.information__tab__content *::text').extract()
data_inter=''.join(data_box).strip()
match1=re.search(r'([Ii]ngr[ée]dients\s*\:{0,1})\s*(.*)\.*',data_inter)
match2=re.search(r'([Cc]omposition\s*\:{0,1})\s*(.*)\.*',data_inter)
if match1:
result_matching_ingredients=match1.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()
elif match2 :
result_matching_ingredients=match2.group(1,2)[1].replace('"','').replace(".","").replace(";",",").strip()
else:
result_matching_ingredients=''
ingredients=result_matching_ingredients
It seems that the matching never occurs on scrapy cloud.
For prices, my code is the following :
list_prices=[]
for package in list_packaging :
tonnage=package.css('div.product__varianttitle::text').extract_first().strip()
prix_inter=(''.join(package.css('span.product__smallprice__text').re(r'\(\s*\d+\,\d*\s*€\s*\/\s*kg\)')))
prix=prix_inter.replace("(","").replace(")","").replace("/","").replace("€","").replace("kg","").replace(",",".").strip()
list_prices.append(prix)
That's the same story. Still empty.
I repeat : it's working fine on my local version.
Those two data are the only one causing issue : i'm extracting a bunch of other data (with Regex too) with scrapy cloud and I'm very satisfied with it ?
Any ideas guys ?
I work really often with ScrapingHub, and usually the way I do to debug is:
Check the job requests (through the ScrapingHub interface)
In order to check if there is not a redirection which makes the page slightly different, like a query string ?lang=en
Check the job logs (through the ScrapingHub interface)
You can either print or use a logger to check everything you want trough your parser. So if you really want to be sure the scraper display the same on local machine and on ScrapingHub, you can print(response.body) and compare what might cause this difference.
If you can not find, I'll try to deploy a little spider on ScrapingHub and edit this post if I can manage to have some time left today !
Check that Scrapping Hub’s logs are displaying the expected version of Python even if the stack is correctly set up in the project’s yml file.
Related
I have a simple program (not related to school) that requires a lot of random words in the local database. Earlier today, I found this website http://www.setgetgo.com/randomword/get.php that will always generate a random word every time the page is reloaded. I have an idea to create a variable that will consistently grab the value from this website, and append it to my list (acts as a local database).
Any idea how to do that? I thought there is a "wget" library in python too. However, my python keeps returning an error.
My idea:
a_variable = wget the website text
Here is the block of code you need
import requests
res = requests.get("http://www.setgetgo.com/randomword/get.php")
print res.content
I would adivce you to dive into Request and BeautifulSoup. If you want to learn more about it.
Goodluck
Can anyone guide me towards the right direction as to where I should place a script solely for loading data into ndb. As I wish to upload all data into the gae ndb so that the application could perform query on it.
Right now, the loading of data is in my application. I wish to placed it separately from the main application.
Should it be edited in the yaml file?
EDITED
This is a snippet of the entity and the handler to upload the data into GAE ndb.
I wish to placed this chunk of code separately from my main application .py. Reason being the uploading of this data won't be done frequently and to keep the codes in the main application "cleaner".
class TagTrend_refine(ndb.Model):
tag = ndb.StringProperty()
trendData = ndb.BlobProperty(compressed=True)
class MigrateData(webapp2.RequestHandler):
def get(self):
listOfEntities = []
f = open("tagTrend_refine.txt")
lines = f.readlines()
f.close()
for line in lines:
temp = line.strip().split("\t")
data = TagTrend_refine(
tag = temp[0],
trendData = temp[1]
)
listOfEntities.append(data)
ndb.put_multi(listOfEntities)
For example if I placed the above code in a file called dataLoader.py, where should I call this script to invoke?
In app.yaml alongside my main application(knowledgeGraph.application)?
- url: /.*
script: knowledgeGraph.application
You don't show us the application object (no doubt a WSGI app) in your knowledge.py module, so I can't know what URL you want to serve with the MigrateData handler -- I'll just guess it's something like /migratedata.
So the class TagTrend_refine should be in a separate file (usually called models.py) so that both your dataloader.py, and your knowledge.py, can import models to access it (and models.py will need its own import of ndb of course). (Then of course access to the entity class will be as models.TagTrend_refine -- very basic Python).
Next, you'll complete dataloader.py by defining a WSGI app, e.g, at end of file,
app = webapp2.WSGIApplication(routes=[('/migratedata', MigrateData)])
(of course this means this module will need to import webapp2 as well -- can I take for granted a knowledge of super-elementary Python?).
In app.yaml, as the first URL, before that /.*, you'll have:
url: /migratedata
script: dataloader.app
Given all this, when you visit '/migratedata', your handler will read the "tagTrend_refine.txt" file that you uploaded together with your .py, .yaml, and so on, files in your overall GAE application, and unconditionally create one entity per line of that file (assuming you fix the multiple indentation problems in your code as displayed above, but, again, this is just super-elementary Python -- presumably you've used both tabs and spaces and they show up OK in your editor, but not here on SO... I recommend you use strictly, only spaces, never tabs, in Python code).
However this does seem to be a peculiar task. If /migratedata gets visited twice, it will create duplicates of all entities. If you change the tagTrend_refine.txt and deploy a changed variation, then visit /migratedata... all old entities will stick around and all the new entities will join them. And so forth.
Moreover -- /migratedata is NOT idempotent (if visited more than once it does not produce the same state as running it just once) so it shouldn't be a GET (and now we're on to super-elementary HTTP for a change!-) -- it should be a POST.
In fact I suspect (but I'm really flying blind here, since you see fit to give such tiny amounts of information) that you in fact want to upload a .txt file to a POST handler and do the updates that way (perhaps avoiding duplicates...?). However, I'm no mind reader, so this is about as far as I can go.
I believe I have fully answered the question you posted (though perhaps not the one you meant but didn't express:-) and by SO's etiquette it would be nice to upvote and accept this answer, then, if needed, post another question, expressing MUCH more clearly and completely what you're trying to achieve, your current .py and .yaml (ideally with correct indentation), what they actually do and why you'd like to do something different. For POST vs GET in particular, just study When should I use GET or POST method? What's the difference between them? ...
Alex's solution will work, as long as all you data can be loaded in under 1 minute, as that's the timeout for an app engine request.
For larger data, consider calling the datastore API directly from your own computer where you have the source. It's a bit of a hassle because it's a different API; it's not ndb. But it's still a pretty simple API. Here's some code that calls the API:
https://github.com/GoogleCloudPlatform/getting-started-python/blob/master/2-structured-data/bookshelf/model_datastore.py
Again, this code can run anywhere. It doesn't need to be uploaded to app engine to run.
I just set up the environment for an existing Django project, on a new Mac. I know for certain there is nothing wrong with the code itself (just cloned the repo), but for some reason, Django can't seem to retrieve data from the database.
I know the correct tables and data is in the db.
I know the codebase is as it should be.
I can make queries using the Django shell.
Django doesn't throw any errors despite the data missing on the web page.
I realize that it's hard to debug this without further information, but I would really appreciate a finger pointing me to the right direction. I can't seem to find any useful logs.
EDIT:
I just realized the problem lies elsewhere. Unfortunately I can't delete this post with the bounty still open.
Without seeing any code, I can only suggest some general advice that might help you debug your problem. Please add a link to your repository if you can or some snippets of your database settings, the view which includes the database queries etc...
Debugging the view
The first thing I would recommend is using the python debugger inside the view which queries the database. If you've not used pdb before, it's a life saver which allows you to set breakpoints in your Python script and then interactively execute code inside the interpreter
>>> import pdb
>>> pdb.set_trace()
>>> # look at the results of your queries
If you are using the Django ORM, the QuerySet returned from the query should have all the data you expect.
If it doesn't then you need to look into your database configuration in settings.py.
If it does, then you must might not be returning that object to the template? Unlikely as you said the code was the same, but double check the objects you pass with your HttpResponse object.
Debugging the database settings
If you can query the database using the project settings inside settings.py from the django shell it sounds unlikley that there is a problem with this - but like everything double check.
You said that you've set up a new project on a mac. What is on a different operating system before? Maybe there is a problem with the paths now - to make your project platform independent remember to use the os.path.join() method when working with file paths.
And what about the username and password details....
Debugging the template
Maybe your template is referencing the wrong object variable name or object attribute.You mentioned that
Django doesn't throw any errors despite the data missing on the web
page.
This doesn't really tell us much - to quote the Django docs -
If you use a variable that doesn’t exist, the template system will
insert the value of the TEMPLATE_STRING_IF_INVALID setting, which is
set to '' (the empty string) by default.
So to check all the variables available to your template, you could use the debug template tag
{{ debug }}
Probably even better though is to use the django-debugging-toolbar - this will also let you examine the SQL queries your view is making.
Missing Modules
I would expect this to raise an exception if this were the problem, but have you checked that you have the psycopg module on your new machine?
so i made a python script to grab images from a subreddit (from Imgur and imgur albums). i successfully done that (it returns img urls) and wanted to integrate it into django so i can deploy it online and let other people use it. when i started running the server at my machine, the images from subreddit loads flawlessly, but when i try another subreddit, it craps out on me (i'll post the exception at the end of the post). so i restart the django server, and same thing happen. the images loads without a hitch. but the second time i do it, it craps out on me. what gives?
Exception Type: siteError, which pretty much encompasses urllib2.HTTPError, urllib2.URLError, socket.error, socket.sslerror
since i'm a noob in all of this, i'm not sure what's going on. so anyone care to help me?
note: l also host the app on pythoneverywhere.com. same result.
Using a global in your get_subreddit function looks wrong to me.
reddit_url = 'http://reddit.com/r/'
def get_subreddit(name):
global reddit_url
reddit_url += name
Every time, you run that function, you append the value of name to a global reddit_url.
It starts as http://reddit.com/r/
run get_subreddit("python") and it changes to http://reddit.com/r/python
run get_subreddit("python") again, and it changes to http://reddit.com/r/pythonpython
at this point, the url is invalid, and you have to restart your server.
You probably want to change get_subreddit so that it returns a url, and fetch this url in your function.
def get_subreddit(name):
return "http://reddit.com/r/" + name
# in your view
url = get_subreddit("python")
# now fetch url
There are probably other mistakes in your code as well. You can't really expect somebody on stack overflow to fix all the problems for you on a project of this size. The best thing you can do is learn some techniques for debugging your code yourself.
Look at the full traceback, not just the final SiteError. See what line of your code the problem is occurring in.
Add some logging or print statement, and try and work out why the SiteError is occurring.
Are you trying to download the url that you think you are (as I explained above, I don't think you are, because of problems with your get_subreddit method).
Finally, I recommend you make sure that the site works on your dev machine before you move on to deploying it on python anywhere. Deploying can cause lots of headaches all by itself, so it's good to start with an app that's working before you start.
Good luck :)
Like many others, I've been learning web development on django through building a test app. I've the basic models set up. I've populated a few of the tables with the absolute minimum data needed for further testing though using fixtures.
Now for a different table, I want to create data tuples through a custom management command which takes the required arguments. If this works as expected, I'll save the created data to the database by adding the --save option.
The syntax of the command is like this
create_raw_data owner_id temperature [--save]
where owner_id is required and temperature (in C) is optional. Within the Handle method, I'm using factory boy to create the raw_data with the given arguments etc.
I did have some issues but searching on SO, google, django docs etc, I've got the command working fine.
EXCEPT when I input a negative temperature...
Then I get the following error
Usage: C:\test\manage.py create_raw_data [options]
Creates a RawData object. Usage: create_raw_data owner_id temperature [--save]
C:\test\manage.py: error: no such option: -5
The code I have for parsing the args is like this
for index, item in enumerate(args):
if index == 0:
owner_id = int(item)
else index == 1:
temp = int(item)
I put a print(args) as the 1st line inside Handle but it seems the control is not even reaching here.
I'm not sure what is wrong... please help...
Thanks a lot
got the issue fixed so providing an answer to others who may come across this.
The issue was with parse_args method of optparse. I've read in a number of places that though optparse is deprecated and instead argparse is recommended, django recommends using optparse since that is what it uses. Long story short, the link at link suggested a few alternatives and using create_raw_data 1 -- -5 works as expected. So I did get a workaround. Thanks.