Scrapy download files from FTP - python-2.7

I need to download a group of csv using scrapy from FTP. But first I need to scrape a website(https://www.douglas.co.us/assessor/data-downloads/) in order to get the urls of csv in the ftp.I read about how to download files in the documentation(Downloading and processing files and images)
settings
custom_settings = {
'ITEM_PIPELINES': {
'scrapy.pipelines.files.FilesPipeline': 1,
},
'FILES_STORE' : os.path.dirname(os.path.abspath(__file__))
}
parse
def parse(self, response):
self.logger.info("In parse method!!!")
# Property Ownership
property_ownership = response.xpath("//a[contains(., 'Property Ownership')]/#href").extract_first()
# Property Location
property_location = response.xpath("//a[contains(., 'Property Location')]/#href").extract_first()
# Property Improvements
property_improvements = response.xpath("//a[contains(., 'Property Improvements')]/#href").extract_first()
# Property Value
property_value = response.xpath("//a[contains(., 'Property Value')]/#href").extract_first()
item = FiledownloadItem()
self.insert_keyvalue(item,"file_urls",[property_ownership, property_location, property_improvements, property_value])
yield item
But I got the following error
Traceback (most recent call last): File
"/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py",
line 653, in _runCallbacks
current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py",
line 79, in process_item
requests = arg_to_iter(self.get_media_requests(item, info)) File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/files.py",
line 382, in get_media_requests
return [Request(x) for x in item.get(self.files_urls_field, [])] File
"/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py",
line 25, in init
self._set_url(url) File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py",
line 58, in _set_url
raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: [
The best explanation to my problem is this answer of this question scrapy error :exceptions.ValueError: Missing scheme in request url:, that explain that the problem is that urls to download are missing the "http://".
What should I do in my case? Can I use FilesPipeline? or I need to do something different?
Thanks in advance.

ValueError('Missing scheme in request url: %s' % self._url)
ValueError: Missing scheme in request url: [
According to the traceback, scrapy thinks your file url is '['.
My best guess is that you have an error in the insert_keyvalue() method.
Also, why have a method for this? Simple assignment should work.

Related

'WebSocketProtocol' object has no attribute 'application_queue' error Django Channels

In my Django project I want to create a chat app using channels.But when I followed this tutorial:
https://channels.readthedocs.io/en/stable/tutorial/part_2.html , I had a problem that websocket auto disconnect after connect :
Exception in callback AsyncioSelectorReactor.callLater.<locals>.run()
at
C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\twisted\internet\asyncioreact
or.py:287 handle: <TimerHandle when=19405.993
AsyncioSelectorReactor.callLater.<locals>.run() at
C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\twisted\interne
t\asyncioreactor.py:287> Traceback (most recent call last): File
"C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\asyncio\events.py",
line 145, in _run
self._callback(*self._args) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\twisted\internet\asyncioreactor.py",
line 290, in run
f(*args, **kwargs) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\twisted\internet\tcp.py",
line 289, in connectionLost
protocol.connectionLost(reason) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\autobahn\twisted\websocket.py",
line 128, in connectionLost
self._connectionLost(reason) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\autobahn\websocket\protocol.py",
line 2467, in _connectionLost
WebSocketProtocol._connectionLost(self, reason) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\autobahn\websocket\protocol.py",
line 1096, in _connectionLost
self._onClose(self.wasClean, WebSocketProtocol.CLOSE_STATUS_CODE_ABNORMAL_CLOSE, "connection was
closed uncleanly (%s)" % self.wasNotCleanReason) File
"C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\autobahn\twisted\websocket.py",
line 171, in _onClose
self.onClose(wasClean, code, reason) File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\lib\site-packages\daphne\ws_protocol.py",
line 146, in onClose
self.application_queue.put_nowait({ AttributeError: 'WebSocketProtocol' object has no attribute 'application_queue'
Above this error should be a list of paths and at the top of that paths are the real error. In my case it was the routing url.
migration solve it for me
python manage.py migrate
Just to clarify this error for everyone who faces it while following the channels tutorial for me the error was in my path
2018-05-17 15:55:36,382 - ERROR - ws_protocol - [Failure instance:
Traceback: : No route found for path
'ws/chat/me/'.
You can find it on top of the stacktrace
My defined invalid URL pattern "buzz" because I like to change things up. Essentially it works the same way as django url patterns so just need to make sure that your routing structure matches what you specify.
websocket_urlpatterns = [
url(r'^ws/buzz/(?P<room_name>[^/]+)/$', consumers.ChatConsumer),
]
In my case the route path did not match my websocket patterns
I had to update the route path to be correct -> by changing my websocket_urlpatterns back to chat
websocket_urlpatterns = [
url(r'^ws/chat/(?P<room_name>[^/]+)/$', consumers.ChatConsumer),
]
So the system could resolve my issue. Hopefully it helps people who run into this in the future. Good luck. P.T.

Error while accessing Google Contacts Groups by API

I want get list of all groups in my contacts. I use that code:
import gdata.gauth
import gdata.contacts.client
token = gdata.gauth.OAuth2Token(client_id = "***.apps.googleusercontent.com",
client_secret = "***",
scope = "https://www.google.com/m8/feeds/",
user_agent = "GC")
gd_client = gdata.contacts.client.ContactsClient(source = 'GCv0.1')
gd_client = token.authorize(gd_client)
gd_client.GetGroups()
But got error:
Traceback (most recent call last):
File "F:/Yandex/Sites/GoogleContacts/cli_contacts.py", line 27, in <module>
gd_client.GetGroups()
File "C:\Users\Ishayahu\27Gdata\lib\site-packages\gdata\contacts\client.py", line 218, in get_groups
return self.get_feed(uri, desired_class=desired_class, auth_token=auth_token, **kwargs)
File "C:\Users\Ishayahu\27Gdata\lib\site-packages\gdata\client.py", line 640, in get_feed
**kwargs)
File "C:\Users\Ishayahu\27Gdata\lib\site-packages\gdata\client.py", line 319, in request
RequestError)
gdata.client.RequestError: Server responded with: 400,
enter code here
I have no idea what a reason and I can't find any clue how to solve it.
UPD: It looks like I should somehow put acess_token or refresh_token into OAuth2Token, but I can't understand how
so it send headers
{'GData-Version': '3', 'Authorization': 'Bearer None', 'User-Agent': 'gdata-py/2.0.17'}
UPD2: by the way, if I test in in OAuth2Playground, it shows me a page with request access to my contacts. That script doesn't ask for it. Maybe that's the problem? How can I change it? I thought, it connect with url_redirect, but I can't uderstand, how to use it
UPD3: I was right: if I add access_token, which I get manualy from Playground, all works. But how should I get it in script?!

Exception when trying to create bigquery table via python API

I'm working on an app that will stream events into BQ. Since Streamed Inserts require the table to pre-exist, I'm running the following code to check if the table exists, and then to create it if it doesn't:
TABLE_ID = "data" + single_date.strftime("%Y%m%d")
exists = False;
request = bigquery.tables().list(projectId=PROJECT_ID,
datasetId=DATASET_ID)
response = request.execute()
while response is not None:
for t in response.get('tables', []):
if t['tableReference']['tableId'] == TABLE_ID:
exists = True
break
request = bigquery.tables().list_next(request, response)
if request is None:
break
if not exists:
print("Creating Table " + TABLE_ID)
dataset_ref = {'datasetId': DATASET_ID,
'projectId': PROJECT_ID}
table_ref = {'tableId': TABLE_ID,
'datasetId': DATASET_ID,
'projectId': PROJECT_ID}
schema_ref = SCHEMA
table = {'tableReference': table_ref,
'schema': schema_ref}
table = bigquery.tables().insert(body=table, **dataset_ref).execute(http)
I'm running python 2.7, and have installed the google client API through PIP.
When I try to run the script, I get the following error:
No handlers could be found for logger "oauth2client.util"
Traceback (most recent call last):
File "do_hourly.py", line 158, in <module>
main()
File "do_hourly.py", line 101, in main
body=table, **dataset_ref).execute(http)
File "build/bdist.linux-x86_64/egg/oauth2client/util.py", line 142, in positional_wrapper
File "/usr/lib/python2.7/site-packages/googleapiclient/http.py", line 721, in execute
resp, content = http.request(str(self.uri), method=str(self.method),
AttributeError: 'module' object has no attribute 'request'
I tried researching the issue, but all I could find was info about confusing between urllib, urllib2 and Python 2.7 / 3.
I'm not quite sure how to continue with this, and will appreciate all help.
Thanks!
Figured out that the issue was in the following line, which I took from another SO thread:
table = bigquery.tables().insert(body=table, **dataset_ref).execute(http)
Once I removed the "http" variable, which doesn't exist in my scope, the exception dissappeared

CRITICAL: Unhandled error in Deferred:

I am developing spider project and I have moved to a new computer. Now I am installing everything and I encounter problem with Twisted. I have read about this bug and I have installed pywin32 and then also WinPython, but it doesn't help. I have tried to update Twisted with this command
pip install Twisted --update
as advised in the forum, but it says that pip install doesn't have --update option. I have also run
python python27\scripts\pywin32_postinstall.py -install
but with no success. This is my error:
G:\Job_vacancies\Python\vacancies>scrapy crawl jobs
2015-10-06 09:12:53 [scrapy] INFO: Scrapy 1.0.3 started (bot: vacancies)
2015-10-06 09:12:53 [scrapy] INFO: Optional features available: ssl, http11
2015-10-06 09:12:53 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'va
cancies.spiders', 'SPIDER_MODULES': ['vacancies.spiders'], 'DEPTH_LIMIT': 3, 'BO
T_NAME': 'vacancies'}
2015-10-06 09:12:53 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsol
e, LogStats, CoreStats, SpiderState
Unhandled error in Deferred:
2015-10-06 09:12:53 [twisted] CRITICAL: Unhandled error in Deferred:
Traceback (most recent call last):
File "c:\python27\lib\site-packages\scrapy\cmdline.py", line 150, in _run_comm
and
cmd.run(args, opts)
File "c:\python27\lib\site-packages\scrapy\commands\crawl.py", line 57, in run
self.crawler_process.crawl(spname, **opts.spargs)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 153, in crawl
d = crawler.crawl(*args, **kwargs)
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1274, in
unwindGenerator
return _inlineCallbacks(None, gen, Deferred())
--- <exception caught here> ---
File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 1128, in
_inlineCallbacks
result = g.send(result)
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 71, in crawl
self.engine = self._create_engine()
File "c:\python27\lib\site-packages\scrapy\crawler.py", line 83, in _create_en
gine
return ExecutionEngine(self, lambda _: self.stop())
File "c:\python27\lib\site-packages\scrapy\core\engine.py", line 66, in __init
__
self.downloader = downloader_cls(crawler)
File "c:\python27\lib\site-packages\scrapy\core\downloader\__init__.py", line
65, in __init__
self.handlers = DownloadHandlers(crawler)
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\__init__.p
y", line 23, in __init__
cls = load_object(clspath)
File "c:\python27\lib\site-packages\scrapy\utils\misc.py", line 44, in load_ob
ject
mod = import_module(module)
File "c:\python27\lib\importlib\__init__.py", line 37, in import_module
__import__(name)
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\s3.py", li
ne 6, in <module>
from .http import HTTPDownloadHandler
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http.py",
line 5, in <module>
from .http11 import HTTP11DownloadHandler as HTTPDownloadHandler
File "c:\python27\lib\site-packages\scrapy\core\downloader\handlers\http11.py"
, line 15, in <module>
from scrapy.xlib.tx import Agent, ProxyAgent, ResponseDone, \
File "c:\python27\lib\site-packages\scrapy\xlib\tx\__init__.py", line 3, in <m
odule>
from twisted.web import client
File "c:\python27\lib\site-packages\twisted\web\client.py", line 42, in <modul
e>
from twisted.internet.endpoints import TCP4ClientEndpoint, SSL4ClientEndpoin
t
File "c:\python27\lib\site-packages\twisted\internet\endpoints.py", line 34, i
n <module>
from twisted.internet.stdio import StandardIO, PipeAddress
File "c:\python27\lib\site-packages\twisted\internet\stdio.py", line 30, in <m
odule>
from twisted.internet import _win32stdio
File "c:\python27\lib\site-packages\twisted\internet\_win32stdio.py", line 7,
in <module>
import win32api
exceptions.ImportError: DLL load failed: The specified module could not be found
.
2015-10-06 09:12:53 [twisted] CRITICAL:
And this is my code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
# encoding=UTF-8
import scrapy, urlparse
from scrapy.http import Request
from scrapy.utils.response import get_base_url
from urlparse import urlparse, urljoin
from vacancies.items import JobItem
#We need that in order to force Slovenian pages instead of English pages. It happened at "http://www.g-gmi.si/gmiweb/" that only English pages were found and no Slovenian.
#from scrapy.conf import settings
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl',}
#settings.overrides['DEFAULT_REQUEST_HEADERS'] = {'Accept':'text/html,application/xhtml+xml;q=0.9,*/*;q=0.8','Accept-Language':'sl','en':q=0.8,}
class JobSpider(scrapy.Spider):
name = "jobs"
#Test sample of SLO companies
start_urls = [
"http://www.g-gmi.si/gmiweb/",
]
#Result of the programme is this list of job vacancies webpages.
jobs_urls = []
def parse(self, response):
response.selector.remove_namespaces()
#We take all urls, they are marked by "href". These are either webpages on our website either new websites.
urls = response.xpath('//#href').extract()
#Base url.
base_url = get_base_url(response)
#Loop through all urls on the webpage.
for url in urls:
#If url represents a picture, a document, a compression ... we ignore it. We might have to change that because some companies provide job vacancies information in PDF.
if url.endswith((
#images
'.jpg', '.jpeg', '.png', '.gif', '.eps', '.ico', '.svg', '.tif', '.tiff',
'.JPG', '.JPEG', '.PNG', '.GIF', '.EPS', '.ICO', '.SVG', '.TIF', '.TIFF',
#documents
'.xls', '.ppt', '.doc', '.xlsx', '.pptx', '.docx', '.txt', '.csv', '.pdf', '.pd',
'.XLS', '.PPT', '.DOC', '.XLSX', '.PPTX', '.DOCX', '.TXT', '.CSV', '.PDF', '.PD',
#music and video
'.mp3', '.mp4', '.mpg', '.ai', '.avi', '.swf',
'.MP3', '.MP4', '.MPG', '.AI', '.AVI', '.SWF',
#compressions and other
'.zip', '.rar', '.css', '.flv', '.php',
'.ZIP', '.RAR', '.CSS', '.FLV', '.PHP',
)):
continue
#If url includes characters like ?, %, &, # ... it is LIKELY NOT to be the one we are looking for and we ignore it.
#However in this case we exclude good urls like http://www.mdm.si/company#employment
if any(x in url for x in ['?', '%', '&', '#']):
continue
#Ignore ftp.
if url.startswith("ftp"):
continue
#We need to save original url for xpath, in case we change it later (join it with base_url)
url_xpath = url
#If url doesn't start with "http", it is relative url, and we add base url to get absolute url.
# -- It is true, that we may get some strange urls, but it is fine for now.
if not (url.startswith("http")):
url = urljoin(base_url,url)
#We don't want to go to other websites. We want to stay on our website, so we keep only urls with domain (netloc) of the company we are investigating.
if (urlparse(url).netloc == urlparse(base_url).netloc):
#The main part. We look for webpages, whose urls include one of the employment words as strings.
# -- Instruction.
# -- Users in other languages, please insert employment words in your own language, like jobs, vacancies, career, employment ... --
if any(x in url for x in [
'zaposlovanje',
'Zaposlovanje',
'zaposlitev',
'Zaposlitev',
'zaposlitve',
'Zaposlitve',
'zaposlimo',
'Zaposlimo',
'kariera',
'Kariera',
'delovna-mesta',
'delovna_mesta',
'pridruzi-se',
'pridruzi_se',
'prijava-za-delo',
'prijava_za_delo',
'oglas',
'Oglas',
'iscemo',
'Iscemo',
'careers',
'Careers',
'jobs',
'Jobs',
'employment',
'Employment',
]):
#This is additional filter, suggested by Dan Wu, to improve accuracy. We will check the text of the url as well.
texts = response.xpath('//a[#href="%s"]/text()' % url_xpath).extract()
#1. Texts are empty.
if texts == []:
print "Ni teksta za url: " + str(url)
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
#item["text"] = text
item["url"] = url
#We return the item.
yield item
# 2. There are texts, one or more.
else:
#For the same partial url several texts are possible.
for text in texts:
if any(x in text for x in [
'zaposlovanje',
'Zaposlovanje',
'zaposlitev',
'Zaposlitev',
'zaposlitve',
'Zaposlitve',
'zaposlimo',
'Zaposlimo',
'ZAPOSLIMO',
'kariera',
'Kariera',
'delovna-mesta',
'delovna_mesta',
'pridruzi-se',
'pridruzi_se',
'oglas',
'Oglas',
'iscemo',
'Iscemo',
'ISCEMO',
'careers',
'Careers',
'jobs',
'Jobs',
'employment',
'Employment',
]):
#We found url that includes one of the magic words and also the text includes a magic word.
#We check url, if we have found it before. If it is new, we add it to the list "jobs_urls".
if url not in self.jobs_urls:
self.jobs_urls.append(url)
item = JobItem()
item["text"] = text
item["url"] = url
#We return the item.
yield item
#We don't put "else" sentence because we want to further explore the employment webpage to find possible new employment webpages.
#We keep looking for employment webpages, until we reach the DEPTH, that we have set in settings.py.
yield Request(url, callback = self.parse)
# We run the programme in the command line with this command:
# scrapy crawl jobs -o jobs.csv -t csv --logfile log.txt
# We get two output files
# 1) jobs.csv
# 2) log.txt
# Then we manually put one of employment urls from jobs.csv into read.py
I would be glad if you could give some advice on how to run this thing. Thank you, Marko
You should always install stuff into a virtualenv. Once you've got a virtualenv and it's active, do:
pip install --upgrade twisted pypiwin32
and you should get the depenendency that makes Twisted support stdio on the Windows platform.
To get all the goodies you might try
pip install --upgrade twisted[windows_platform]
but you may run into problems with gmp.h if you try that, and you don't need most of it to do what you're trying to do.

cant get the facebook-tornado connection

i've a problem, am a beginer and i try to make a simple asynchronous program that posts to facebook, i use the tornado example and the tornado-facebook-sdk, here is the code:
class MainHandler(BaseHandler, tornado.auth.FacebookGraphMixin):
#tornado.web.authenticated
#tornado.web.asynchronous
def get(self):
self.facebook_request("/me/home", self.print_callback, access_token=self.current_user["access_token"])
a = self.current_user["access_token"]
#print a
def print_callback(data):
print data
ioloop.stop()
graph.get_object('/facebook', callback=print_callback)
and i get this error:
TypeError: print_callback() takes exactly 1 argument (2 given)
because i want to understand this example to get the token, and then use the example:
def callback(response):
# ...
graph.put_object('me', 'feed', message="Maoe!!", callback=callback)
to write something on my facebook's wall, i did it with the synchronous library, but sadly this is blocking!
UPDATE: still getting and error:
class MainHandler(BaseHandler, tornado.auth.FacebookGraphMixin):
#tornado.web.authenticated
#tornado.web.asynchronous
def get(self):
self.facebook_request("/me/home", self.print_callback, access_token=self.current_user["access_token"])
a = self.current_user["access_token"]
print a
def print_callback(self, data):
graph.post_wall(self, "heloooooooo")
and got this error:
[E 121009 14:28:47 web:1108] Uncaught exception GET / (::1)
HTTPRequest(.....)
Traceback (most recent call last):
File "C:\Python27\lib\site-packages\tornado-2.4.post1-py2.7.egg\tornado\web.py", line 1043, in _stack_context_handle_exception
raise_exc_info((type, value, traceback))
File "C:\Python27\lib\site-packages\tornado-2.4.post1 py2.7.egg\tornado\stack_context.py", line 237, in _nested
yield vars
File "C:\Python27\lib\site-packages\tornado-2.4.post1-py2.7.egg\tornado\stack_context.py", line 210, in wrapped
callback(*args, **kwargs)
File "C:\Python27\lib\site-packages\tornado-2.4.post1-py2.7.egg\tornado\gen.py", line 405, in inner self.set_result(key, result)
File "C:\Python27\lib\site-packages\tornado-2.4.post1-py2.7.egg\tornado\gen.py", line 335, in set_result
self.run()
File "C:\Python27\lib\site-packages\tornado-2.4.post1-py2.7.egg\tornado\gen.py", line 365, in run
yielded = self.gen.send(next)
File "build\bdist.win-amd64\egg\facebook\graphapi.py", line 129, in _make_request
raise GraphAPIError(data)
GraphAPIError: (#200) This API call requires a valid app_id.
and when i go to Facebook, i see that it's a valide key that i'm using, i even use the generated Token (here the a variable), and pasting it to Api Debug, and i got everything works fine:
Valid : True
Origin : Web
Scopes : create_note photo_upload publish_actions publish_stream read_stream share_item status_update video_upload
Add self to print_callback.
def print_callback(self, data):
print data
ioloop.stop()
graph.get_object('/facebook', callback=print_callback)