Scrapy-Scraper Does Not Run - python-2.7

I can run python using Beautiful Soup and Mechanized, but for some reason when I try to use Spray-Scraper it just doesn't work. Here's an example of what happens when I attempt to test the scraper with a tutorial:
Project name & BOT name = "tutorial"
The following scripts are the items.py and settings.py that I used.
items.py
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
with open(filename, 'wb') as f:
f.write(response.body)
settings.py
BOT_NAME = 'tutorial'
SPIDER_MODULES = ['tutorial.spiders']
NEWSPIDER_MODULE = 'tutorial.spiders'
CMD
C:\Users\Turbo>scrapy startproject tutorial
New Scrapy project 'tutorial' created in:
C:\Users\Turbo\tutorial
You can start your first spider with:
cd tutorial
scrapy genspider example example.com
C:\Users\Turbo>cd tutorial
C:\Users\Turbo\tutorial>scrapy crawl dmoz
Traceback (most recent call last):
File "C:\Python27\Scripts\scrapy-script.py", line 9, in <module>
load_entry_point('scrapy==0.24.4', 'console_scripts', 'scrapy')()
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\cmdline.py"
, line 143, in execute
_run_print_help(parser, _run_command, cmd, args, opts)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\cmdline.py"
, line 89, in _run_print_help
func(*a, **kw)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\cmdline.py"
, line 150, in _run_command
cmd.run(args, opts)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\commands\cr
awl.py", line 58, in run
spider = crawler.spiders.create(spname, **opts.spargs)
File "C:\Python27\lib\site-packages\scrapy-0.24.4-py2.7.egg\scrapy\spidermanag
er.py", line 44, in create
raise KeyError("Spider not found: %s" % spider_name)
KeyError: 'Spider not found: dmoz'

The problem is that you are putting your spider into the items.py.
Instead, create a package spiders, inside it create a dmoz.py and put your spider into it.
See more at Our first Spider paragraph of the tutorial.

Related

Error while loading .h5 model in Flask using keras

I have built a horse human detector using keras CNN on Google colab the model worked and loaded perfectly on colab. Now I am building a flask application while loading he .h5 model file I was getting error
TypeError: __init__() got an unexpected keyword argument 'ragged'
I reinstall keras 2.3.1 using pip and now I am getting a library error
NameError: name 'six' is not defined
my App.py
#Import necessary libraries
from flask import Flask, render_template, request
import numpy as np
import os
from keras.preprocessing.image import load_img
from keras.preprocessing.image import img_to_array
from keras.models import load_model
#load model
model = load_model("predictor.h5" )
print("## model loaded")
def pred_human_horse(model , horse_or_human):
test_image = load_img(horse_or_human , target_size=(150,150)) #resize
print("## Got Image for predicton")
test_image = img_to_array(test_image)/255 #numpy array between 0-1
test_image = np.expand_dims(test_image,axis=0) #4 dimension
result= model.predict(test_image).round(3) #rounding off
pred =np.argmax(result)
print("## Raw results = ",result)
print("## class = ", pred)
if pred==0:
return "Horse"
else:
return "Human"
# Crate flask app
app = Flask(__name__)
#app.route("/",methods=["GET","POST"])
def home():
return render_template("index.html")
#app.route("/predict",methods=["GET","POST"])
def predict():
if request.method=="POST":
#get input image file
file = request.files["image"]
filename= file.filename
print("## File recieved",filename)
#save the file
file_path= os.path.join("static/user_uploaded",filename)
file.save(file_path)
print("## Prediction...")
pred=pred_human_horse(horse_or_human=file_path )
return render_template("predict.html" ,pred_output= pred , user_image=file_path )
if __name__=="__main__":
app.run(threaded=False)
Error I am getting
runfile('F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction/app.py', wdir='F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction')
Traceback (most recent call last):
File "<ipython-input-26-df590f092cb6>", line 1, in <module>
runfile('F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction/app.py', wdir='F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction')
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "F:/INTERNSHIP/Crisp-Metric-MAY21/Human-horse-prediction/app.py", line 13, in <module>
model = load_model("predictor.h5" )
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\keras\engine\saving.py", line 492, in load_wrapper
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\keras\engine\saving.py", line 582, in load_model
File "C:\Users\DANIA NIAZI\Anaconda3\lib\site-packages\keras\utils\io_utils.py", line 211, in is_supported_type
NameError: name 'six' is not defined
Maybe you should try installing the six package which will be installed when installing Django. Anyway you can install it using:
pip install six

File writing in Django keeps having IOError

I'm running my app locally and I'm currently having an IOError during my file creation from the database. I am using Django 1.10, MongoDB as my database, and Celery 4.0.2 for my background tasks. The problem occurs in the tasks.py since that is where I access the db then store it in my django subfolder 'analysis_samples'.
Here is the traceback:
[2017-04-15 15:31:08,798: ERROR/PoolWorker-2] Task tasks.process_sample_input[0619194e-4300-4a1d-91b0-20766e048c4a] raised unexpected: IOError(2, 'No such file or directory')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 367, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/celery/app/trace.py", line 622, in __protected_call__
return self.run(*args, **kwargs)
File "/home/user/django_apps/myapp/analysis/tasks.py", line 218, in process_sample_input
with open(sample_file_path, "w+") as f:
IOError: [Errno 2] No such file or directory: u'/home/user/django_apps/myapp/myapp/analysis_samples/58f1cc3c45015d127c3d68c1'
And here is the snippet of tasks.py:
from django.core.files import File
sys.path.append(settings.ANALYSIS_SAMPLES)
import base64
import os, sys
#shared_task(name='tasks.process_sample_input')
def process_sample_input(instance_id):
instance = Sample.objects.get(pk=instance_id)
#many code here..
try:
conn=pymongo.MongoClient(settings.MONGO_HOST, settings.MONGO_PORT)
db = conn.thugfs #connect to GridFS db of thug
thugfs_db = GridFS(db)
except pymongo.errors.ConnectionFailure, e:
logger.error("Could not connect to ThugFS MongoDB: %s" % e)
sample_file_folder = settings.ANALYSIS_SAMPLES
for sample_fs_id in sample_fs_ids:
sample_file = thugfs_db.get(ObjectId(sample_fs_id)).read()
sample_file = base64.b64decode(sample_file) #decode file from database
sample_file_path = os.path.join(sample_file_folder, sample_fs_id)
with open(sample_file_path, "w+") as f:
fileOut = File(f)
fileOut.write(sample_file)
settings.py:
ANALYSIS_SAMPLES = os.path.join(BASE_DIR, 'myapp/analysis_samples')
Can anyone see the point that caused the error? Any help will be appreciated.

Behave ImportError: No module named features.steps.pages.home_page

I have a sample BDD scenario in Python Behave. When i run the feature I get the error:
ImportError: No module named features.steps.pages.home_page
I am not sure why it is complaining. home_page.py is in the pages folder, pages is in the steps folder and steps folder is in the features folder.
In pages folder I have an init.py file.
Why is it complaining it cannot find home_page.py?
My code is: features\steps.py
from behave import *
#from features.steps.pages import home_page
from features.steps.pages.home_page import HomePage
#from features.steps.pages import search_page
from features.steps.pages.search_page import SearchPage
from features.steps.pages import home_page
#Given ('we are on the homepage')
def step(context):
context.browser.get('http://www.test.com')
#When ('we enter "{product}" in the search field')
def step(context, product):
#search_field = context.browser.find_element(By.XPATH, 'id("twotabsearchtextbox")')
#search_field.send_keys(product)
home_page = HomePage(context)
home_page.enter_product_in_search_field(product, context)
#When ('And we click the search button')
def step(context):
#search_button = context.browser.find_element(By.XPATH, './/*[#id="nav-search"]/form/div[2]/div/input')
searchPage_results = home_page.click_search_button(context)
#search_button.click()
#Then ('the list of products are displayed')
def step(context):
context.searchPage_results.search_products_results(context)
#wait = WebDriverWait(context.browser, 60)
#divs = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//div/a/h2')))
#for i in divs:
#div2 = divs + '/a/h2'
#print divs.get_attribute('value')
#print divs
#print i.text
#print "i"
# divs
features\steps\pages\home_page.py
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from search_page import SearchPage
class HomePage(object):
def __init__(self, context):
context = context
def enter_product_in_search_field(self, product, context):
search_field = context.browser.find_element(By.XPATH, 'id("twotabsearchtextbox")')
search_field.send_keys(product)
return self
def click_search_button(self, context):
search_button = context.find_element(By.XPATH, './/*[#id="nav-search"]/form/div[2]/div/input').click()
return SearchPage(context)
features\test_feature.feature
Feature: testing product
Scenario Outline: visit test and search for product
Given we are on the test homepage
When we enter "<product>" in the search field
And we click the search button
Then the list of products are displayed
Examples: By product
| Forumla One |
| PS4 |
| Headphones |
My directory structure is:
E:features\test_feature.feature
E:features\init.py
E:features\pages\init.py
E:features\pages\home_page.py
E:features\pages\search_page.py
The full error is:
E:\RL Fusion\projects\BDD\Python BDD\PythonBDD\Selenium Sample\features>behave test_feature.feature
Exception ImportError: No module named features.steps.pages.home_page
Traceback (most recent call last):
File "C:\Python27\lib\runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File "C:\Python27\lib\runpy.py", line 72, in _run_code
exec code in run_globals
File "C:\Python27\scripts\behave.exe\__main__.py", line 9, in <module>
File "C:\Python27\lib\site-packages\behave\__main__.py", line 109, in main
failed = runner.run()
File "C:\Python27\lib\site-packages\behave\runner.py", line 672, in run
return self.run_with_paths()
File "C:\Python27\lib\site-packages\behave\runner.py", line 678, in run_with_paths
self.load_step_definitions()
File "C:\Python27\lib\site-packages\behave\runner.py", line 658, in load_step_definitions
exec_file(os.path.join(path, name), step_module_globals)
File "C:\Python27\lib\site-packages\behave\runner.py", line 304, in exec_file
exec(code, globals, locals)
File "steps\amazon_steps.py", line 6, in <module>
from features.steps.pages.home_page import HomePage
ImportError: No module named features.steps.pages.home_page
How can I resolve this issue?
Thanks, Riaz
It looks like you are not importing your modules correctly. To turn a directory in to a module, change all your init.py files to __init__.py.
Then when you are importing in features/steps.py you can use:
from pages.home_page import HomePage

Log warning from Selenium on Django [duplicate]

Whenever I try to construct a string based on self.live_server_url, I get python TypeError messages. For example, I've tried the following string constructions (form 1 & 2 below), but I experience the same TypeError. My desired string is the Live Server URL with "/lists" appended. NOTE: the actual test does succeed to create a server and I can manually access the server, and more specifically, I can manually access the exact URL that I'm trying to build programmatically (e.g. 'http://localhost:8081/lists').
TypeErrors occur with these string constructions.
# FORM 1
lists_live_server_url = '%s%s' % (self.live_server_url, '/lists')
# FORM 2
lists_live_server_url = '{0}{1}'.format(self.live_server_url, '/lists')
self.browser.get(lists_live_server_url)
There is no python error with this form (nothing appended to string), albeit my test fails (as I would expect since it isn't accessing /lists).
self.browser.get(self.live_server_url)
Here is the python error that I'm getting.
/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/bin/python3.4 /Applications/PyCharm.app/Contents/helpers/pycharm/django_test_manage.py test functional_tests.lists_tests.LiveNewVisitorTest.test_can_start_a_list_and_retrieve_it_later /Users/myusername/PycharmProjects/mysite_proj
Testing started at 11:55 AM ...
Creating test database for alias 'default'...
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/wsgiref/handlers.py", line 137, in run
self.result = application(self.environ, self.start_response)
File "/usr/local/lib/python3.4/site-packages/django/test/testcases.py", line 1104, in __call__
return super(FSFilesHandler, self).__call__(environ, start_response)
File "/usr/local/lib/python3.4/site-packages/django/core/handlers/wsgi.py", line 189, in __call__
response = self.get_response(request)
File "/usr/local/lib/python3.4/site-packages/django/test/testcases.py", line 1087, in get_response
return self.serve(request)
File "/usr/local/lib/python3.4/site-packages/django/test/testcases.py", line 1099, in serve
return serve(request, final_rel_path, document_root=self.get_base_dir())
File "/usr/local/lib/python3.4/site-packages/django/views/static.py", line 54, in serve
fullpath = os.path.join(document_root, newpath)
File "/usr/local/Cellar/python3/3.4.2_1/Frameworks/Python.framework/Versions/3.4/lib/python3.4/posixpath.py", line 82, in join
path += b
TypeError: unsupported operand type(s) for +=: 'NoneType' and 'str'
Am I unknowingly attempting to modify the live_server_url, which is leading to these TypeErrors? How could I programmatically build a string of live_server_url + "/lists"?
Here is the test that I am attempting...
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from django.test import LiveServerTestCase
class LiveNewVisitorTest(LiveServerTestCase):
def setUp(self):
self.browser = webdriver.Chrome()
self.browser.implicitly_wait(3)
def tearDown(self):
self.browser.close()
def test_can_start_a_list_and_retrieve_it_later(self):
#self.browser.get('http://localhost:8000/lists')
#self.browser.get('http://www.google.com')
#lists_live_server_url = '%s%s' % (self.live_server_url, '/lists')
#lists_live_server_url = '{0}{1}'.format(self.live_server_url, '/lists')
lists_live_server_url = self.live_server_url
self.browser.get(lists_live_server_url)
self.assertIn('To-Do', self.browser.title)
header_text = self.browser.find_element_by_tag_name('h1').text
self.assertIn('To-Do', header_text)
See this discussion on Reddit featuring the same error Traceback.
Basically, this is not a problem with anything within the Selenium tests but rather with your project's static file configuration.
From your question, I believe the key line within the Traceback is:
File "/usr/local/lib/python3.4/site-packages/django/views/static.py", line 54, in serve
fullpath = os.path.join(document_root, newpath)
This line indicates that an unsuccessful os.path.join is being attempted within django.views.static.
Set STATIC_ROOT in your project's settings.pyfile and you should be good.
Use StaticLiveServerTestCase instead may help

Make/force Scrapy to use Python 2.7

On my system, I've both python3 and python 2.7, scrapy only suports python2.7 but by debault my libraries are linked to python 3.4.
I am trying to run a basic example that comes with scrapy documentation that is:
#!/usr/bin/python2.7
import scrapy
class StackOverflowSpider(scrapy.Spider):
name = 'stackoverflow'
start_urls = ['http://stackoverflow.com/questions?sort=votes']
def parse(self, response):
for href in response.css('.question-summary h3 a::attr(href)'):
full_url = response.urljoin(href.extract())
yield scrapy.Request(full_url, callback=self.parse_question)
def parse_question(self, response):
yield {
'title': response.css('h1 a::text').extract()[0],
'votes': response.css('.question .vote-count-post::text').extract()[0],
'body': response.css('.question .post-text').extract()[0],
'tags': response.css('.question .post-tag::text').extract(),
'link': response.url,
}
To run this code, is suggested for the command:
scrapy runspider stackoverflow_spider.py -o top-stackoverflow-questions.json
being stackoverflow_spider.py the above snippet.
The problem is that somehow this is calling python3,and since I am not explicitly calling a python version, not sure how to force the command to use python2.7 libs.
below is the error I get:
Traceback (most recent call last):
File "/usr/local/bin/scrapy", line 9, in <module>
load_entry_point('Scrapy==1.0.5', 'console_scripts', 'scrapy')()
File "/usr/local/lib/python3.4/dist-packages/scrapy/cmdline.py", line 122, in execute
cmds = _get_commands_dict(settings, inproject)
File "/usr/local/lib/python3.4/dist-packages/scrapy/cmdline.py", line 46, in _get_commands_dict
cmds = _get_commands_from_module('scrapy.commands', inproject)
File "/usr/local/lib/python3.4/dist-packages/scrapy/cmdline.py", line 29, in _get_commands_from_module
for cmd in _iter_command_classes(module):
File "/usr/local/lib/python3.4/dist-packages/scrapy/cmdline.py", line 21, in _iter_command_classes
for obj in vars(module).itervalues():
AttributeError: 'dict' object has no attribute 'itervalues'
PS: I have installed scrapy under python 2.7