I'm new in programming and I'm trying to learn scrapy, using scrapy tutorial: http://doc.scrapy.org/en/latest/intro/tutorial.html
So I ran "scrapy crawl dmoz" command and got this error:
2015-07-14 16:11:02 [scrapy] INFO: Scrapy 1.0.1 started (bot: tutorial)
2015-07-14 16:11:02 [scrapy] INFO: Optional features available: ssl, http11
2015-07-14 16:11:02 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tu
torial.spiders', 'SPIDER_MODULES': ['tutorial.spiders'], 'BOT_NAME': 'tutorial'}
2015-07-14 16:11:05 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
Unhandled error in Deferred:
2015-07-14 16:11:06 [twisted] CRITICAL: Unhandled error in Deferred:
2015-07-14 16:11:07 [twisted] CRITICAL:
I'm using windows 7 and python 2.7. Anybody knows what's the problem? How could I fix that?
EDIT: My spider file code is:
# This package will contain the spiders of your Scrapy project
#
# Please refer to the documentation for information on how to create and manage
# your spiders.
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/computers/programming/languages/python/books/",
"http://www.dmoz.org/computer/programming/languages/python/resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename,'wb') as f:
f.write(response.body)
items.py code:
import scrapy
class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
pip list:
bootstrap-admin (0.3.3)
cffi (1.1.2)
characteristic (14.3.0)
cryptography (0.9.3)
cssselect (0.9.1)
Django (1.7.7)
django-auth-ldap (1.2.4)
django-debug-toolbar (1.3.0)
django-mssql (1.6.2)
django-pyodbc (0.2.6)
django-pyodbc-azure (1.2.2)
django-redator (0.2.3)
django-reversion (1.8.5)
django-summernote (0.6.0)
django-windows-tools (0.1.1)
django-wysiwyg-redactor (0.4.3.2)
enum34 (1.0.4)
ez-setup (0.9)
flup (1.0.2)
idna (2.0)
ipaddress (1.0.13)
iso8601 (0.1.4)
logging (0.4.9.6)
lxml (3.4.4)
mechanize (0.2.5)
MySQL-python (1.2.4)
pbr (0.10.8)
Pillow (2.7.0)
pip (7.1.0)
pyasn1 (0.1.8)
pyasn1-modules (0.0.6)
pycparser (2.14)
pymongo (2.6)
pyodbc (3.0.7)
pyOpenSSL (0.15.1)
pypm (1.4.3)
python-ldap (2.4.18)
pythonselect (1.3)
pywin32 (218.3)
queuelib (1.2.2)
Scrapy (1.0.1)
selenium (2.44.0)
service-identity (14.0.0)
setuptools (18.0.1)
six (1.9.0)
sqlparse (0.1.15)
stevedore (1.3.0)
Twisted (15.2.1)
virtualenv (1.11.6)
virtualenv-clone (0.2.5)
virtualenvwrapper (4.3.2)
virtualenvwrapper-powershell (12.7.8)
w3lib (1.11.0)
xlrd (0.9.2)
zope.interface (4.1.2)
Thx for the attention and sry for my poor English, isn't my native language.
I'm beginning to learn scrapy as well and encounter the same question with yours.
After struggling with it for an afternoon, finally I found it's due to the pywin32 module only download without install.
You can try input the command below in the cmd to finish the pywin32 module install and try crawl again:
python python27\scripts\pywin32_postinstall.py -install
I hope it will help!
The short answer is You are missing pywin32!
The other answers are basically right, but not 100% correct. pywin32 is not a pip install! You must download the installer package from here:
http://sourceforge.net/projects/pywin32/files/pywin32/
Make sure that you get the correct bit: 32 or 64. In my case I didn't realize I had the 32bit version of Python installed on my 64bit machine and the installer fails with "Cannot find Python 2.7 installation in registry". I had to install the 32bit version of pywin32. Once I did this, scrapy crawl site worked.
I dont see what your doing with items as your writing to a file. But its the imports maybe. Try this if this does not work try, pip install pywin --update and pip install Twisted --update, that should reinstall any corrupted files.
Plus I dont know if it's Stack's problem but you had some misplaced identation.
from scrapy.spiders import Spider
from {Projectname}.items import {Itemclass}
import scrapy
class DmozSpider(scrapy.Spider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/computers/programming/languages/python/books/",
"http://www.dmoz.org/computer/programming/languages/python/resources/"]
def parse(self, response):
filename = response.url.split("/")[-2] + '.html'
with open(filename,'wb') as f:
f.write(response.body)
Scrapy crashes with: ImportError: No module named win32api
You need to install pywin32 because of this Twisted bug.
Related
While deploying a Django + React project on Heroku, this error occoured:
The conflict is caused by:
djoser 2.1.0 depends on social-auth-app-django<5.0.0 and >=4.0.0
rest-social-auth 8.0.0 depends on social-auth-app-django<6.0 and >=5.0
If I downgrade to social-auth-app-django==4.0.0 pkg, then get this error:
raise ImproperlyConfigured(
django.core.exceptions.ImproperlyConfigured: WSGI application 'backend.wsgi.application' could not be loaded; Error importing module.
This error is caused by social_django which is added in settings.py
MIDDLEWARE = [
....
# For social auth
'social_django.middleware.SocialAuthExceptionMiddleware',
....
]
Fixed this error by removing/commenting it out, then found another one:
cannot import name 'urlquote' from 'django.utils.http' (lib\site-packages\django\utils\http.py)
Because urlquote() is no longer available in Django 4.0+ versions, after downgrading social-auth-app-django==4.0.0
pkg.
This try to import from django.utils.http import urlquote in filelib\site-packages\social_django\context_processors.py.
I'm in Dependency hell. I have even tried to downgrading the djoser pkg, then got other errors.
After searching a lot, I found this blog post, according to this:
First, pip install pip-tools then create a requirements.in file and add
django
djangorestframework
then run pip-compile ./requirements.in this will generate requirements.txt file:
# This file is autogenerated by pip-compile with Python 3.9
# by the following command:
#
# pip-compile ./requirements.in
#
asgiref==3.6.0
# via django
django==4.1.5
# via
# -r ./requirements.in
# djangorestframework
djangorestframework==3.14.0
# via -r ./requirements.in
pytz==2022.7.1
# via djangorestframework
sqlparse==0.4.3
# via django
tzdata==2022.7
# via django
But this file doesn't contain other packages like:
django-cors-headers,
djoser,
PyJWT
rest-social-auth
social-auth-app-django,
etc....
Please help me with this question, any resource that can help me.
I want to unit test a bunch of code that looks like this:
# First command. This will open reddit in your browser.
# -------------------------------------------------------------
if 'open reddit' in command:
url = 'https://www.reddit.com/'
if not runtest:
webbrowser.open(url)
print('Done!')
talktome.talkToMe('reddit is opening.')
if runtest:
return url
# -------------------------------------------------------------
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
I've written this unit test:
import unittest
import subprocess
from GreyMatter import julibrain
from SpeakAndHear import talktome
class TestBrain(unittest.TestCase):
def test_open_reddit(self):
test = True
testurl = julibrain.assistant('open reddit', 1, 2, test)
#subprocess.call(['pip', 'list', '|', 'grep', 'webbrowser'])
self.assertEqual(testurl, 'https://www.reddit.com/')
I invoke it on the command line:
python -m unittest test_julibrain.py
It gives me this output:
python -m unittest test_julibrain.py
Package Version
----------------- ----------
autopep8 1.5.2
beautifulsoup4 4.9.0
certifi 2020.4.5.1
chardet 3.0.4
click 7.1.1
future 0.18.2
gTTS 2.1.1
gTTS-token 1.1.3
idna 2.9
isort 4.3.21
lazy-object-proxy 1.4.3
mccabe 0.6.1
mock 4.0.1
MouseInfo 0.1.3
mypy 0.770
mypy-extensions 0.4.3
numpy 1.18.1
Pillow 7.1.1
pip 20.0.2
psutil 5.7.0
PyAudio 0.2.11
PyAutoGUI 0.9.50
pycodestyle 2.5.0
PyGetWindow 0.0.8
pylint 2.4.4
PyMsgBox 1.0.7
pyperclip 1.8.0
PyQt5 5.14.2
PyQt5-sip 12.7.2
PyRect 0.1.4
PyScreeze 0.1.26
python3-xlib 0.15
PyTweening 1.0.3
requests 2.23.0
setuptools 45.2.0
six 1.14.0
soupsieve 2.0
subprocess.run 0.0.8
typed-ast 1.4.1
typing-extensions 3.7.4.1
urllib3 1.25.9
vosk 0.3.3
wheel 0.34.2
wikipedia 1.4.0
wrapt 1.11.2
.
----------------------------------------------------------------------
Ran 1 test in 0.285s
OK
So I can test that the url is being set, but I don't know how to see if the webbrowser module is working. If I run the code, it the webbrowser opens. I don't want to open it. I'd like to get back some sort of status that says the module has been imported, a webbroser object can be instantiated, and there are no problems. After looking at the webbrowser doc, I have no idea how to do this. Any ideas? Thank you. I'm on Ubuntu 18.04, using conda and python 3.6.1. Cheers!
I am getting this import error from the scipy package. Please see requirements.txt for versions. This is only happening on Flask and not when I run the same code on ipython. Is there any reason why statsmodels doesn't work on flask but would work on ipython?
File "/env/lib/python3.7/site-packages/statsmodels/api.py", line 9, in
from . import regression
File "/env/lib/python3.7/site-packages/statsmodels/regression/init.py", line 1, in
from .linear_model import yule_walker
File "/env/lib/python3.7/site-packages/statsmodels/regression/linear_model.py", line 39, in
from scipy.linalg import toeplitz
File "/env/lib/python3.7/site-packages/scipy/init.py", line 156, in
from . import fft
File "/env/lib/python3.7/site-packages/scipy/fft/init.py", line 81, in
from ._helper import next_fast_len
File "/env/lib/python3.7/site-packages/scipy/fft/_helper.py", line 4, in
from . import _pocketfft
File "/env/lib/python3.7/site-packages/scipy/fft/_pocketfft/init.py", line 3, in
from .basic import *
File "/env/lib/python3.7/site-packages/scipy/fft/_pocketfft/basic.py", line 8, in
from . import pypocketfft as pfft
ImportError: SystemExit: 1
*This happens only in Flask. When i use statsmodels
I'm running it in flask and here are my complete list of packages:
attrs==19.3.0
autopep8==1.5
backcall==0.1.0
bleach==3.1.0
cachetools==4.0.0
cattrs==1.0.0
certifi==2019.11.28
chardet==3.0.4
Click==7.0
cloudstorage==0.10.0
cycler==0.10.0
decorator==4.4.1
defusedxml==0.6.0
entrypoints==0.3
Flask==1.1.1
Flask-Assets==2.0
Flask-CacheBuster==1.0.0
Flask-Cors==3.0.8
Flask-SQLAlchemy==2.4.1
flask-talisman==0.7.0
google-api-core==1.16.0
google-auth==1.11.0
google-cloud==0.34.0
google-cloud-core==1.2.0
google-cloud-storage==1.25.0
google-cloud-tasks==1.3.0
google-resumable-media==0.5.0
googleapis-common-protos==1.51.0
grpc-google-iam-v1==0.12.3
grpcio==1.26.0
idna==2.8
importlib-metadata==1.5.0
inflect==3.0.2
inflection==0.3.1
ipykernel==5.1.4
ipython==7.11.1
ipython-genutils==0.2.0
ipywidgets==7.5.1
itsdangerous==1.1.0
jedi==0.16.0
Jinja2==2.11.0
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==5.3.4
jupyter-console==6.1.0
jupyter-contrib-core==0.3.3
jupyter-contrib-nbextensions==0.5.1
jupyter-core==4.6.1
jupyter-highlight-selected-word==0.2.0
jupyter-latex-envs==1.4.6
jupyter-nbextensions-configurator==0.4.1
kiwisolver==1.1.0
looker-sdk==0.1.3b6
lxml==4.5.0
MarkupSafe==1.1.1
matplotlib==3.1.2
mistune==0.8.4
natural==0.2.0
nbconvert==5.6.1
nbformat==5.0.4
nltk==3.4.5
notebook==6.0.3
numpy==1.18.1
pandas==0.25.3
pandas-datareader==0.8.1
pandocfilters==1.4.2
parso==0.6.0
patsy==0.5.1
pexpect==4.8.0
pg8000==1.13.2
pickleshare==0.7.5
prometheus-client==0.7.1
prompt-toolkit==3.0.3
protobuf==3.11.2
psycopg2==2.8.4
ptyprocess==0.6.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycodestyle==2.5.0
Pygments==2.5.2
pyparsing==2.4.6
pyrsistent==0.15.7
python-dateutil==2.8.1
python-magic==0.4.15
pytz==2019.3
PyYAML==5.3
pyzmq==18.1.1
qtconsole==4.6.0
requests==2.22.0
rsa==4.0
scipy==1.4.1
scramp==1.1.0
seasonal==0.3.1
Send2Trash==1.5.0
simple-salesforce==0.74.3
simplenlg==0.2.0
six==1.14.0
SQLAlchemy==1.3.13
statsmodels==0.11.0
stripe==2.42.0
terminado==0.8.3
testpath==0.4.4
tornado==6.0.3
traitlets==4.3.3
tzlocal==2.0.0
urllib3==1.25.8
wcwidth==0.1.8
webapp2==2.5.2
webassets==2.0
webencodings==0.5.1
Werkzeug==0.16.1
widgetsnbextension==3.5.1
zipp==2.1.0
no error in ipython, but i get an error when i run it in flask
Checkout this issues for more detail.
Turns out that statsmodels is dependent upon several packages being installed before it so that it can key on them to compile its own modules. Although I don't exactly know what causes this particular conflict with flask, I have faced this on python3.7 before and here is what worked for me:
If you need to clean out what you already have, you can uninstall with the following command:
pip3 uninstall statsmodels
Now install Flask
pip3 install Flask
Install the individual dependencies of statmodels first:
pip3 install numpy scipy patsy pandas
Now install statmodels:
pip3 install statsmodels
Important!!!
Make sure you are running a fresh virtual environment. To open a new one in linxux system:
python3 -m venv venv
source venv/bin/activate
Now install the dependencies as instructed above. Let me know if that fixed your problem. If not, you might need a better dependency resolver than pip. Poetry is one of them.
Just restart the server and the problem is solved!
I've come accross the following error about html5lib when trying to read an html data frame.
Here is the code:
!pip install html5lib
!pip install lxml
!pip install beautifulSoup4
import html5lib
import lxml
from bs4 import BeautifulSoup
table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")
This is the error:
ImportError Traceback (most recent call last)
<ipython-input-68-e24654a0a301> in <module>()
----> 1 table_list = pd.read_html("http://www.psmsl.org/data/obtaining/")
/home/sage/sage-8.0/local/lib/python2.7/site-packages/pandas/io/html.pyc in read_html(io, match, flavor, header, index_col, skiprows, attrs, parse_dates, tupleize_cols, thousands, encoding, decimal, converters, na_values, keep_default_na)
913 thousands=thousands, attrs=attrs, encoding=encoding,
914 decimal=decimal, converters=converters, na_values=na_values,
--> 915 keep_default_na=keep_default_na)
/home/sage/sage-8.0/local/lib/python2.7/site-packages/pandas/io/html.pyc in _parse(flavor, io, match, attrs, encoding, **kwargs)
737 retained = None
738 for flav in flavor:
--> 739 parser = _parser_dispatch(flav)
740 p = parser(io, compiled_match, attrs, encoding)
741
/home/sage/sage-8.0/local/lib/python2.7/site-packages/pandas/io/html.pyc in _parser_dispatch(flavor)
680 if flavor in ('bs4', 'html5lib'):
681 if not _HAS_HTML5LIB:
--> 682 raise ImportError("html5lib not found, please install it")
683 if not _HAS_BS4:
684 raise ImportError(
ImportError: html5lib not found, please install it
Any help would be much appreciated.
Thanks
If you read the error message, you don't have html5lib installed. Do:
pip install html5lib
in your terminal.
If you are calling from jupyter notebook (just like you did with !), try to restart the kernel in order to have the packages loaded.
I had this exact error show up while trying to read a saved .htm file using Spyder IDE.
This code displayed html5lib error:
import pandas as pd
df = pd.read_html("F:\xxxx\xxxxx\xxxxx\aaaa.htm")
I knew I had html5lib installed and working correctly because I had other scripts that worked.
For whatever reason, file path needed to be a string literal (putting an r in front of the file path).
This code works for me:
import pandas as pd
df = pd.read_html(r"F:\xxxx\xxxxx\xxxxx\aaaa.htm")
I ran into this error when I gave the wrong path to the local file I was trying to open. So also be sure that you're pointing to the right place!
I'm trying to update my Django-cms from version 2.2 to 2.3 and when I try to "runserver", I get this error:
django.template.base.TemplateSyntaxError: 'cms_tags' is not a valid tag library: ImportError raised loading cms.templatetags.cms_tags: cannot import name VersionAdapter
The installation of django-cms=2.3 worked perfectly, but there is something somewhere broking everything.
Here's a list of what's installed on this virtualenv:
BeautifulSoup==3.2.0
Django==1.3.1
Fabric==1.4.3
MySQL-python==1.2.3
PIL==1.1.7
South==0.7.3
argparse==1.2.1
cmsplugin-filer==0.8.0
django-admin-tools==0.4.0
-e git://github.com/bmihelac/django-app-name-translation-in- admin.git#c7b033bdfb54fee5217248f4e596aa23752185e0#egg=django_app_name_translation_in_admin-dev
django-appconf==0.4.1
django-classy-tags==0.3.4.1
django-cms==2.3
django-debug-toolbar==0.8.5
django-filer==0.8.5
-e git+https://github.com/toastdriven/django-haystack.git#d2b51143c8008be8523a78822002de1836b9f501#egg=django_haystack-dev
django-hvad==0.1.5
django-mptt==0.5.1
django-page-cms==1.4.5
django-permissions==1.0.3
django-reversion==1.4
django-sekizai==0.5
django-staticfiles==1.1.2
django-transmeta==0.6.2
easy-thumbnails==1.0-alpha-18
feedparser==5.0.1
html5lib==0.90
httplib2==0.7.2
ipython==0.11
poster==0.8.1
pycrypto==2.6
pysolr==2.1.0-beta
python-ptrace==0.6.4
simplejson==2.2.1
sorl-thumbnail==11.09
ssh==1.7.14
virtualenv==1.6.4
wsgiref==0.1.2
Anyone have an idea ?
VersionAdapter belongs to django-reversion so you need need to upgrade (django-reversion) to version 1.6. See the django-cms installation notes:
http://docs.django-cms.org/en/2.3.3/getting_started/installation.html