I can run python using Beautiful Soup and Mechanized, but for some reason when I try to use Spray-Scraper it just doesn't work. Here's an example of what happens when I attempt to test the scraper with a tutorial:
Project name & BOT name = "tutorial"
The following scripts are the items.py and settings.py that I used.
items.py
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
settings.py
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
CMD
C:\Users\Turbo>scrapy startproject tutorial
New Scrapy project 'tutorial' created in:
C:\Users\Turbo\tutorial
You can start your first spider with:
cd tutorial
scrapy genspider example example.com
C:\Users\Turbo>cd tutorial
C:\Users\Turbo\tutorial>scrapy crawl dmoz
Traceback (most recent call last):
File "C:\Python27\Scripts\scrapy-script.py", line 9, in <module>
load_entry_point('scrapy==0.24.4', 'console_scripts', 'scrapy')()
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\cmdline.py"
, line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\cmdline.py"
, line 89, in _run_print_help
func(*a, **kw)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\cmdline.py"
, line 150, in _run_command
cmd.run(args, opts)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\commands\cr
awl.py", line 58, in run
spider = crawler.spiders.create(spname, **opts.spargs)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\spidermanag
er.py", line 44, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: dmoz'
The problem is that you are putting your spider into the items.py.
Instead, create a package spiders, inside it create a dmoz.py and put your spider into it.
See more at Our first Spider paragraph of the tutorial.
Related
I have built a horse human detector using keras CNN on Google colab the model worked and loaded perfectly on colab. Now I am building a flask application while loading he .h5 model file I was getting error
TypeError: __init__() got an unexpected keyword argument 'ragged'
I reinstall keras 2.3.1 using pip and now I am getting a library error
NameError: name 'six' is not defined
my App.py
#Import necessary libraries
from flask import Flask, render_template, request
import numpy as np
import os
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.models import load_model
#load model
model = load_model("predictor.h5" )
print("## model loaded")
def pred_human_horse(model , horse_or_human):
test_image = load_img(horse_or_human , target_size=(150,150)) #resize
print("## Got Image for predicton")
test_image = img_to_array(test_image)/255 #numpy array between 0-1
test_image = np.expand_dims(test_image,axis=0) #4 dimension
result= model.predict(test_image).round(3) #rounding off
pred =np.argmax(result)
print("## Raw results = ",result)
print("## class = ", pred)
if pred==0:
return "Horse"
else:
return "Human"
# Crate flask app
app = Flask(__name__)
#app.route("/",methods=["GET","POST"])
def home():
return render_template("index.html")
#app.route("/predict",methods=["GET","POST"])
def predict():
if request.method=="POST":
#get input image file
file = request.files["image"]
filename= file.filename
print("## File recieved",filename)
#save the file
file_path= os.path.join("static/user_uploaded",filename)
file.save(file_path)
print("## Prediction...")
pred=pred_human_horse(horse_or_human=file_path )
return render_template("predict.html" ,pred_output= pred , user_image=file_path )
if __name__=="__main__":
app.run(threaded=False)
Error I am getting
runfile('F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction/app.py', wdir='F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction')
Traceback (most recent call last):
File "<ipython-input-26-df590f092cb6>", line 1, in <module>
runfile('F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction/app.py', wdir='F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction')
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction/app.py", line 13, in <module>
model = load_model("predictor.h5" )
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\keras\engine\saving.py", line 492, in load_wrapper
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\keras\engine\saving.py", line 582, in load_model
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\keras\utils\io_utils.py", line 211, in is_supported_type
NameError: name 'six' is not defined
Maybe you should try installing the six package which will be installed when installing Django. Anyway you can install it using:
pip install six
I'm running my app locally and I'm currently having an IOError during my file creation from the database. I am using Django 1.10, MongoDB as my database, and Celery 4.0.2 for my background tasks. The problem occurs in the tasks.py since that is where I access the db then store it in my django subfolder 'analysis_samples'.
Here is the traceback:
[2017-04-15 15:31:08,798: ERROR/PoolWorker-2] Task tasks.process_sample_input[0619194e-4300-4a1d-91b0-20766e048c4a] raised unexpected: IOError(2, 'No such file or directory')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 367, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 622, in __protected_call__
return self.run(*args, **kwargs)
File "/home/user/django_apps/myapp/analysis/tasks.py", line 218, in process_sample_input
with open(sample_file_path, "w+") as f:
IOError: [Errno 2] No such file or directory: u'/home/user/django_apps/myapp/myapp/analysis_samples/58f1cc3c45015d127c3d68c1'
And here is the snippet of tasks.py:
from django.core.files import File
sys.path.append(settings.ANALYSIS_SAMPLES)
import base64
import os, sys
#shared_task(name='tasks.process_sample_input')
def process_sample_input(instance_id):
instance = Sample.objects.get(pk=instance_id)
#many code here..
try:
conn=pymongo.MongoClient(settings.MONGO_HOST, settings.MONGO_PORT)
db = conn.thugfs #connect to GridFS db of thug
thugfs_db = GridFS(db)
except pymongo.errors.ConnectionFailure, e:
logger.error("Could not connect to ThugFS MongoDB: %s" % e)
sample_file_folder = settings.ANALYSIS_SAMPLES
for sample_fs_id in sample_fs_ids:
sample_file = thugfs_db.get(ObjectId(sample_fs_id)).read()
sample_file = base64.b64decode(sample_file) #decode file from database
sample_file_path = os.path.join(sample_file_folder, sample_fs_id)
with open(sample_file_path, "w+") as f:
fileOut = File(f)
fileOut.write(sample_file)
settings.py:
ANALYSIS_SAMPLES = os.path.join(BASE_DIR, 'myapp/analysis_samples')
Can anyone see the point that caused the error? Any help will be appreciated.
I have a sample BDD scenario in Python Behave. When i run the feature I get the error:
ImportError: No module named features.steps.pages.home_page
I am not sure why it is complaining. home_page.py is in the pages folder, pages is in the steps folder and steps folder is in the features folder.
In pages folder I have an init.py file.
Why is it complaining it cannot find home_page.py?
My code is: features\steps.py
from behave import *
#from features.steps.pages import home_page
from features.steps.pages.home_page import HomePage
#from features.steps.pages import search_page
from features.steps.pages.search_page import SearchPage
from features.steps.pages import home_page
#Given ('we are on the homepage')
def step(context):
context.browser.get('http://www.test.com')
#When ('we enter "{product}" in the search field')
def step(context, product):
#search_field = context.browser.find_element(By.XPATH, 'id("twotabsearchtextbox")')
#search_field.send_keys(product)
home_page = HomePage(context)
home_page.enter_product_in_search_field(product, context)
#When ('And we click the search button')
def step(context):
#search_button = context.browser.find_element(By.XPATH, './/*[#id="nav-search"]/form/div[2]/div/input')
searchPage_results = home_page.click_search_button(context)
#search_button.click()
#Then ('the list of products are displayed')
def step(context):
context.searchPage_results.search_products_results(context)
#wait = WebDriverWait(context.browser, 60)
#divs = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div/a/h2')))
#for i in divs:
#div2 = divs + '/a/h2'
#print divs.get_attribute('value')
#print divs
#print i.text
#print "i"
# divs
features\steps\pages\home_page.py
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from search_page import SearchPage
class HomePage(object):
def __init__(self, context):
context = context
def enter_product_in_search_field(self, product, context):
search_field = context.browser.find_element(By.XPATH, 'id("twotabsearchtextbox")')
search_field.send_keys(product)
return self
def click_search_button(self, context):
search_button = context.find_element(By.XPATH, './/*[#id="nav-search"]/form/div[2]/div/input').click()
return SearchPage(context)
features\test_feature.feature
Feature: testing product
Scenario Outline: visit test and search for product
Given we are on the test homepage
When we enter "<product>" in the search field
And we click the search button
Then the list of products are displayed
Examples: By product
| Forumla One |
| PS4 |
| Headphones |
My directory structure is:
E:features\test_feature.feature
E:features\init.py
E:features\pages\init.py
E:features\pages\home_page.py
E:features\pages\search_page.py
The full error is:
E:\RL Fusion\projects\BDD\Python BDD\PythonBDD\Selenium Sample\features>behave test_feature.feature
Exception ImportError: No module named features.steps.pages.home_page
Traceback (most recent call last):
File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "C:\Python27\lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "C:\Python27\scripts\behave.exe\__main__.py", line 9, in <module>
File "C:\Python27\lib\site-packages\behave\__main__.py", line 109, in main
failed = runner.run()
File "C:\Python27\lib\site-packages\behave\runner.py", line 672, in run
return self.run_with_paths()
File "C:\Python27\lib\site-packages\behave\runner.py", line 678, in run_with_paths
self.load_step_definitions()
File "C:\Python27\lib\site-packages\behave\runner.py", line 658, in load_step_definitions
exec_file(os.path.join(path, name), step_module_globals)
File "C:\Python27\lib\site-packages\behave\runner.py", line 304, in exec_file
exec(code, globals, locals)
File "steps\amazon_steps.py", line 6, in <module>
from features.steps.pages.home_page import HomePage
ImportError: No module named features.steps.pages.home_page
How can I resolve this issue?
Thanks, Riaz
It looks like you are not importing your modules correctly. To turn a directory in to a module, change all your init.py files to __init__.py.
Then when you are importing in features/steps.py you can use:
from pages.home_page import HomePage
Whenever I try to construct a string based on self.live_server_url, I get python TypeError messages. For example, I've tried the following string constructions (form 1 & 2 below), but I experience the same TypeError. My desired string is the Live Server URL with "/lists" appended. NOTE: the actual test does succeed to create a server and I can manually access the server, and more specifically, I can manually access the exact URL that I'm trying to build programmatically (e.g. 'http://localhost:8081/lists').
TypeErrors occur with these string constructions.
# FORM 1
lists_live_server_url = '%s%s' % (self.live_server_url, '/lists')
# FORM 2
lists_live_server_url = '{0}{1}'.format(self.live_server_url, '/lists')
self.browser.get(lists_live_server_url)
There is no python error with this form (nothing appended to string), albeit my test fails (as I would expect since it isn't accessing /lists).
self.browser.get(self.live_server_url)
Here is the python error that I'm getting.
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/bin/python3.4 /Applications/PyCharm.app/Contents/helpers/pycharm/django_test_manage.py test functional_tests.lists_tests.LiveNewVisitorTest.test_can_start_a_list_and_retrieve_it_later /Users/myusername/PycharmProjects/mysite_proj
Testing started at 11:55 AM ...
Creating test database for alias 'default'...
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/wsgiref/handlers.py", line 137, in run
self.result = application(self.environ, self.start_response)
File "/usr/local/lib/python3.4/site-packages/django/test/testcases.py", line 1104, in __call__
return super(FSFilesHandler, self).__call__(environ, start_response)
File "/usr/local/lib/python3.4/site-packages/django/core/handlers/wsgi.py", line 189, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.4/site-packages/django/test/testcases.py", line 1087, in get_response
return self.serve(request)
File "/usr/local/lib/python3.4/site-packages/django/test/testcases.py", line 1099, in serve
return serve(request, final_rel_path, document_root=self.get_base_dir())
File "/usr/local/lib/python3.4/site-packages/django/views/static.py", line 54, in serve
fullpath = os.path.join(document_root, newpath)
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/posixpath.py", line 82, in join
path += b
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'
Am I unknowingly attempting to modify the live_server_url, which is leading to these TypeErrors? How could I programmatically build a string of live_server_url + "/lists"?
Here is the test that I am attempting...
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from django.test import LiveServerTestCase
class LiveNewVisitorTest(LiveServerTestCase):
def setUp(self):
self.browser = webdriver.Chrome()
self.browser.implicitly_wait(3)
def tearDown(self):
self.browser.close()
def test_can_start_a_list_and_retrieve_it_later(self):
#self.browser.get('http://localhost:8000/lists')
#self.browser.get('http://www.google.com')
#lists_live_server_url = '%s%s' % (self.live_server_url, '/lists')
#lists_live_server_url = '{0}{1}'.format(self.live_server_url, '/lists')
lists_live_server_url = self.live_server_url
self.browser.get(lists_live_server_url)
self.assertIn('To-Do', self.browser.title)
header_text = self.browser.find_element_by_tag_name('h1').text
self.assertIn('To-Do', header_text)
See this discussion on Reddit featuring the same error Traceback.
Basically, this is not a problem with anything within the Selenium tests but rather with your project's static file configuration.
From your question, I believe the key line within the Traceback is:
File "/usr/local/lib/python3.4/site-packages/django/views/static.py", line 54, in serve
fullpath = os.path.join(document_root, newpath)
This line indicates that an unsuccessful os.path.join is being attempted within django.views.static.
Set STATIC_ROOT in your project's settings.pyfile and you should be good.
Use StaticLiveServerTestCase instead may help
On my system, I've both python3 and python 2.7, scrapy only suports python2.7 but by debault my libraries are linked to python 3.4.
I am trying to run a basic example that comes with scrapy documentation that is:
#!/usr/bin/python2.7
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
To run this code, is suggested for the command:
scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json
being stackoverflow_spider.py the above snippet.
The problem is that somehow this is calling python3,and since I am not explicitly calling a python version, not sure how to force the command to use python2.7 libs.
below is the error I get:
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 9, in <module>
load_entry_point('Scrapy==1.0.5', 'console_scripts', 'scrapy')()
File "/usr/local/lib/python3.4/dist-packages/scrapy/cmdline.py", line 122, in execute
cmds = _get_commands_dict(settings, inproject)
File "/usr/local/lib/python3.4/dist-packages/scrapy/cmdline.py", line 46, in _get_commands_dict
cmds = _get_commands_from_module('scrapy.commands', inproject)
File "/usr/local/lib/python3.4/dist-packages/scrapy/cmdline.py", line 29, in _get_commands_from_module
for cmd in _iter_command_classes(module):
File "/usr/local/lib/python3.4/dist-packages/scrapy/cmdline.py", line 21, in _iter_command_classes
for obj in vars(module).itervalues():
AttributeError: 'dict' object has no attribute 'itervalues'
PS: I have installed scrapy under python 2.7