scrapy shell enable javascript

scrapy shell enable javascript - python-2.7

I am trying to get the response.body of https://www.wickedfire.com/ in scrapy shell.
but in the response.body it tells me:
<html><title>You are being redirected...</title>\n<noscript>Javascript is required. Please enable javascript before you are allowed to see this page...
How do i activate the javascript? Or is there something else that i can do?
Thank you in advance
UPDATE:
i ve installed pip install scrapy-splash
and i put those commands in settings.py
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPLASH_URL = 'http://localhost:8050/'
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
It does give me an error:
NameError: Module 'scrapy_splash' doesn't define any object named 'SplashCoockiesMiddleware'
I have put it as a comment after that error.and it passed.
And my script is like this...but it doesn't work
...
from scrapy_splash import SplashRequest
...
start_urls = ['https://www.wickedfire.com/login.php?do=login']
payload = {'vb_login_username':'','vb_login_password':''}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,args={'wait':1})
def parse(self, response):
# url = "https://www.wickedfire.com/login.php?do=login"
r = SplashFormRequest(response,formdata=payload,callback=self.after_login)
return r
def after_login(self,response):
print response.body + "THIS IS THE BODY"
if "incorrect" in response.body:
self.logger.error("Login failed")
return
else:
results = FormRequest.from_response(response,
formdata={'query': 'bitter'},
callback=self.parse_page)
return results
...
This is the error that i get:
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://wickedfire.com/ via http://localhost:8050/render.html> (failed 1 times): 502 Bad Gateway
[scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET https://wickedfire.com/ via http://localhost:8050/render.html> (failed 2 times): 502 Bad Gateway
[scrapy.downloadermiddlewares.retry] DEBUG: Gave up retrying <GET https://wickedfire.com/ via http://localhost:8050/render.html> (failed 3 times): 502 Bad Gateway
[scrapy.core.engine] DEBUG: Crawled (502) <GET https://wickedfire.com/ via http://localhost:8050/render.html> (referer: None) ['partial']
[scrapy.spidermiddlewares.httperror] INFO: Ignoring response <502 https://wickedfire.com/>: HTTP status code is not handled or not allowed
i also tried scrapy splash with scrapy shell using this Guide
I just want to login to the page and put in a keyword to be search and get the results. This is my end results.

Related

HTTPSConnectionPool(host='localhost', port=8000): Max retries exceeded with url:/index

I am using Django 2.2.10 and using python manage.py runsslserver to locally develop an https site.
I have written an authentication app, with view function that returns JSON data as ff:
def foobar(request):
data = {
'param1': "foo bar"
}
return JsonResponse(data)
I am calling this function in the parent project as follows:
def index(request):
scheme_domain_port = request.build_absolute_uri()[:-1]
myauth_login_links_url=f"{scheme_domain_port}{reverse('myauth:login_links')}"
print(myauth_login_links_url)
data = requests.get(myauth_login_links_url).json()
print(data)
When I navigate to https://localhost:8000myproj/index, I see that the correct URL is printed in the console, followed by multiple errors, culminating in the error shown in the title of this question:
HTTPSConnectionPool(host='localhost', port=8000): Max retries exceeded with url:/index (Caused by SSLError(SSLError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:852)'),))
How do I pass the SSL cert being used in my session (presumably generated by runsslserver to the requests module) ?

try this:
data = requests.get(myauth_login_links_url, verify=False).json()

Flask: Unknown error although 200 response code is returned

I have created a simple Flask RESTful API with a single function used to service GET requests and which expects several URL parameters:
import logging
from flask import Flask
from flask_restful import Api, Resource
from webargs import fields
from webargs.flaskparser import abort, parser, use_args
# initialize the Flask application and API
app = Flask(__name__)
api = Api(app)
# ------------------------------------------------------------------------------
# set up a basic, global logger object which will write to the console
logging.basicConfig(level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
datefmt="%Y-%m-%d %H:%M:%S")
_logger = logging.getLogger(__name__)
# ------------------------------------------------------------------------------
class Abc(Resource):
abc_args = {"rtsp": fields.Url(required=True),
"start": fields.Integer(required=True),
"duration": fields.Integer(required=True),
"bucket": fields.String(required=True),
"prefix": fields.String(missing="")}
#use_args(abc_args)
def get(self, args) -> (str, int):
_logger.info("Recording video clip with the following parameters:\n"
f"\tRTSP URL: {args['rtsp']}"
f"\tStart seconds: {args['start']}"
f"\tDuration seconds: {args['duration']}"
f"\tS3 bucket: {args['bucket']}"
f"\tS3 key prefix: {args['prefix']}")
return "OK", 200
# ------------------------------------------------------------------------------
# This error handler is necessary for usage with Flask-RESTful
#parser.error_handler
def handle_request_parsing_error(err, req, schema, error_status_code, error_headers):
"""
webargs error handler that uses Flask-RESTful's abort function to return
a JSON error response to the client.
"""
abort(error_status_code, errors=err.messages)
# ------------------------------------------------------------------------------
if __name__ == '__main__':
api.add_resource(Abc, "/abc", endpoint="abc")
app.run(debug=True)
When I send a GET request to the endpoint with or without parameters I get none of the expected behavior -- if good parameters are included in the GET request then I expect to see a log message in the console, and if none of the required parameters are present then I expect to get some sort of error as a result. Instead, I get what appear to be 200 response codes in the console and simply the phrase "Unknown Error" in the main browser window.
For example, if I enter the following URL without the expected parameters into my Chrome browser's address bar: http://127.0.0.1:5000/abc
I then see this in the console:
2019-05-28 17:42:14 INFO 127.0.0.1 - - [28/May/2019 17:42:14] "GET /abc HTTP/1.1" 200 -
My assumption is that the above should throw an error of some sort indicating the missing URL parameters.
If I enter the following URL with the expected parameters into my Chrome browser's address bar: http://127.0.0.1:5000/abc?rtsp=rtsp://user:passwd#171.25.14.15:554&start=1559076593&duration=10&bucket=testbucket&prefix=test.
I then see this in the console:
2019-05-28 17:45:31 INFO 127.0.0.1 - - [28/May/2019 17:45:31] "GET /abc?rtsp=rtsp://user:passwd#171.25.14.15:554&start=1559076593&duration=10&bucket=testbucket&prefix=test. HTTP/1.1" 200 -
My assumption is that the above should cause the logger to print the information message to the console as is defined in the Abc.get function.
If I use curl at the command line then I get the following result:
$ curl "http://127.0.0.1:5000/abc?rtsp=rtsp://user:passwd#171.25.14.15:554&start=1559076593&duration=10&bucket=testbucket&prefix=test."
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2 Final//EN">
<title>None Unknown Error</title>
<h1>Unknown Error</h1>
<p></p>
Whi is this going amiss, and how I can achieve the expected behavior? (My intention is to use this approach to pass arguments to a more realistic GET handler that will launch a video recording function when the request is received, the above has been simplified as much as possible for clarity.)
I am using Flask 1.0.3, Flask-Restful 0.3.6, and Webargs 5.3.1 in an Anaconda environment (Python 3.7) on Ubuntu 18.04.

I have managed to get the expected behavior using a simpler code that does not rely upon flask_restful. See below:
import logging
from flask import Flask, request
# initialize the Flask application
app = Flask(__name__)
# ------------------------------------------------------------------------------
# set up a basic, global logger object which will write to the console
logging.basicConfig(level=logging.INFO,
format="%(asctime)s %(levelname)s %(message)s",
datefmt="%Y-%m-%d %H:%M:%S")
_logger = logging.getLogger(__name__)
# ------------------------------------------------------------------------------
#app.route('/clip', methods=['GET'])
def record_and_store_clip():
message = "Recording video clip with the following parameters:\n" + \
f"\tRTSP URL: {request.args.get('rtsp')}\n" + \
f"\tStart seconds: {request.args.get('start')}\n" + \
f"\tDuration seconds: {request.args.get('duration')}\n" + \
f"\tS3 bucket: {request.args.get('bucket')}\n" + \
f"\tS3 key prefix: {request.args.get('prefix')}\n"
_logger.info(message)
return message
# ------------------------------------------------------------------------------
if __name__ == '__main__':
app.run(debug=True)

Scrapy parse iframe url

I am parsing the links off from a website, then trying to parse those links for the iframe src.
It looks like according to the DEBUG that the first links are being parsed correctly, but I am not getting any data in my output file.
Is it also possible to remove everything after the ? in the URL. This
looks like embeded iframe information.
I am running Centos 6.5 Python 2.7.5
scrapy runspider new.py -o videos.csv
import scrapy
class PdgaSpider(scrapy.Spider):
name = "pdgavideos"
start_urls = ["http://www.pdga.com/videos/"]
def parse(self, response):
for link in response.xpath('//td[2]/a/#href').extract():
from scrapy.http.request import Request
yield Request('http://www.pdga.com'+link, callback=self.parse_page, meta={'link':link})
def parse_page(self, response):
for frame in response.xpath("//player").extract():
yield {
'link': response.urljoin(frame)
}
Debug results
DEBUG: Crawled (200) <GET http://www.pdga.com/videos/2017-gbo-final-round-front-9-sexton-mcbeth-mccray-newhouse> (referer: http://www.pdga.com/videos/)
DEBUG: Crawled (200) <GET http://www.pdga.com/videos/2017-glass-blown-open-fpo-rd-2-pt-1-pierce-fajkus-leatherman-c-allen-sexton-leatherman> (referer: http://www.pdga.com/videos/)
DEBUG: Crawled (200) <GET http://www.pdga.com/videos/2017-gbo-final-round-back-9-sexton-mcbeth-mccray-newhouse> (referer: http://www.pdga.com/videos/)
Expected results
http://www.youtube.com/embed/tYBF-BaqVJ8

Scrapy doese not scrape the content of the iFrames, but you can get them. First get the iframe url, then call parse on it.
urls = response.css('iframe::attr(src)').extract()
for url in urls :
yield scrapy.Request(url....)

HTTP Deadline exceeded waiting for python Google Cloud Endpoints on python client localhost

I want to build a python client to talk to my python Google Cloud Endpoints API. My simple HelloWorld example is suffering from an HTTPException in the python client and I can't figure out why.
I've setup simple examples as suggested in this extremely helpful thread. The GAE Endpoints API is running on localhost:8080 with no problems - I can successfully access it in the API Explorer. Before I added the offending service = build() line, my simple client ran fine on localhost:8080.
When trying to get the client to talk to the endpoints API, I get the following error:
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/dist27/gae_override/httplib.py", line 526, in getresponse
raise HTTPException(str(e))
HTTPException: Deadline exceeded while waiting for HTTP response from URL: http://localhost:8080/_ah/api/discovery/v1/apis/helloworldendpoints/v1/rest?userIp=%3A%3A1
I've tried extending the http deadline. Not only did that not help, but such a simple first call on localhost should not be exceeding a default 5s deadline. I've also tried accessing the discovery URL directly within a browser and that works fine, too.
Here is my simple code. First the client, main.py:
import webapp2
import os
import httplib2
from apiclient.discovery import build
http = httplib2.Http()
# HTTPException happens on the following line:
# Note that I am using http, not https
service = build("helloworldendpoints", "v1", http=http,
discoveryServiceUrl=("http://localhost:8080/_ah/api/discovery/v1/apis/{api}/{apiVersion}/rest"))
# result = service.resource().method([parameters]).execute()
class MainPage(webapp2.RequestHandler):
def get(self):
self.response.headers['Content-type'] = 'text/plain'
self.response.out.write("Hey, this is working!")
app = webapp2.WSGIApplication(
[('/', MainPage)],
debug=True)
Here's the Hello World endpoint, helloworld.py:
"""Hello World API implemented using Google Cloud Endpoints.
Contains declarations of endpoint, endpoint methods,
as well as the ProtoRPC message class and container required
for endpoint method definition.
"""
import endpoints
from protorpc import messages
from protorpc import message_types
from protorpc import remote
# If the request contains path or querystring arguments,
# you cannot use a simple Message class.
# Instead, you must use a ResourceContainer class
REQUEST_CONTAINER = endpoints.ResourceContainer(
message_types.VoidMessage,
name=messages.StringField(1),
)
package = 'Hello'
class Hello(messages.Message):
"""String that stores a message."""
greeting = messages.StringField(1)
#endpoints.api(name='helloworldendpoints', version='v1')
class HelloWorldApi(remote.Service):
"""Helloworld API v1."""
#endpoints.method(message_types.VoidMessage, Hello,
path = "sayHello", http_method='GET', name = "sayHello")
def say_hello(self, request):
return Hello(greeting="Hello World")
#endpoints.method(REQUEST_CONTAINER, Hello,
path = "sayHelloByName", http_method='GET', name = "sayHelloByName")
def say_hello_by_name(self, request):
greet = "Hello {}".format(request.name)
return Hello(greeting=greet)
api = endpoints.api_server([HelloWorldApi])
Finally, here is my app.yaml file:
application: <<my web client id removed for stack overflow>>
version: 1
runtime: python27
api_version: 1
threadsafe: yes
handlers:
- url: /_ah/spi/.*
script: helloworld.api
secure: always
# catchall - must come last!
- url: /.*
script: main.app
secure: always
libraries:
- name: endpoints
version: latest
- name: webapp2
version: latest
Why am I getting an HTTP Deadline Exceeded and how to I fix it?

On your main.py you forgot to add some variables to your discovery service url string, or you just copied the code here without it. By the looks of it you were probably suppose to use the format string method.
"http://localhost:8080/_ah/api/discovery/v1/apis/{api}/{apiVersion}/rest".format(api='helloworldendpoints', apiVersion="v1")
By looking at the logs you'll probably see something like this:
INFO 2015-11-19 18:44:51,562 module.py:794] default: "GET /HTTP/1.1" 500 -
INFO 2015-11-19 18:44:51,595 module.py:794] default: "POST /_ah/spi/BackendService.getApiConfigs HTTP/1.1" 200 3109
INFO 2015-11-19 18:44:52,110 module.py:794] default: "GET /_ah/api/discovery/v1/apis/helloworldendpoints/v1/rest?userIp=127.0.0.1 HTTP/1.1" 200 3719
It's timing out first and then "working".
Move the service discovery request inside the request handler:
class MainPage(webapp2.RequestHandler):
def get(self):
service = build("helloworldendpoints", "v1",
http=http,
discoveryServiceUrl=("http://localhost:8080/_ah/api/discovery/v1/apis/{api}/{apiVersion}/rest")
.format(api='helloworldendpoints', apiVersion='v1'))

Uploading video to YouTube and adding it to playlist using YouTube Data API v3 in Python

I wrote a script to upload a video to YouTube using YouTube Data API v3 in the python with help of example given in Example code.
And I wrote another script to add uploaded video to playlist using same YouTube Data API v3 you can be seen here
After that I wrote a single script to upload video and add that video to playlist. In that I took care of authentication and scops still I am getting permission error. here is my new script
#!/usr/bin/python
import httplib
import httplib2
import os
import random
import sys
import time
from apiclient.discovery import build
from apiclient.errors import HttpError
from apiclient.http import MediaFileUpload
from oauth2client.file import Storage
from oauth2client.client import flow_from_clientsecrets
from oauth2client.tools import run
# Explicitly tell the underlying HTTP transport library not to retry, since
# we are handling retry logic ourselves.
httplib2.RETRIES = 1
# Maximum number of times to retry before giving up.
MAX_RETRIES = 10
# Always retry when these exceptions are raised.
RETRIABLE_EXCEPTIONS = (httplib2.HttpLib2Error, IOError, httplib.NotConnected,
httplib.IncompleteRead, httplib.ImproperConnectionState,
httplib.CannotSendRequest, httplib.CannotSendHeader,
httplib.ResponseNotReady, httplib.BadStatusLine)
# Always retry when an apiclient.errors.HttpError with one of these status
# codes is raised.
RETRIABLE_STATUS_CODES = [500, 502, 503, 504]
CLIENT_SECRETS_FILE = "client_secrets.json"
# A limited OAuth 2 access scope that allows for uploading files, but not other
# types of account access.
YOUTUBE_UPLOAD_SCOPE = "https://www.googleapis.com/auth/youtube.upload"
YOUTUBE_API_SERVICE_NAME = "youtube"
YOUTUBE_API_VERSION = "v3"
# Helpful message to display if the CLIENT_SECRETS_FILE is missing.
MISSING_CLIENT_SECRETS_MESSAGE = """
WARNING: Please configure OAuth 2.0
To make this sample run you will need to populate the client_secrets.json file
found at:
%s
with information from the APIs Console
https://code.google.com/apis/console#access
For more information about the client_secrets.json file format, please visit:
https://developers.google.com/api-client-library/python/guide/aaa_client_secrets
""" % os.path.abspath(os.path.join(os.path.dirname(__file__),
CLIENT_SECRETS_FILE))
def get_authenticated_service():
flow = flow_from_clientsecrets(CLIENT_SECRETS_FILE, scope=YOUTUBE_UPLOAD_SCOPE,
message=MISSING_CLIENT_SECRETS_MESSAGE)
storage = Storage("%s-oauth2.json" % sys.argv[0])
credentials = storage.get()
if credentials is None or credentials.invalid:
credentials = run(flow, storage)
return build(YOUTUBE_API_SERVICE_NAME, YOUTUBE_API_VERSION,
http=credentials.authorize(httplib2.Http()))
def initialize_upload(title,description,keywords,privacyStatus,file):
youtube = get_authenticated_service()
tags = None
if keywords:
tags = keywords.split(",")
insert_request = youtube.videos().insert(
part="snippet,status",
body=dict(
snippet=dict(
title=title,
description=description,
tags=tags,
categoryId='26'
),
status=dict(
privacyStatus=privacyStatus
)
),
# chunksize=-1 means that the entire file will be uploaded in a single
# HTTP request. (If the upload fails, it will still be retried where it
# left off.) This is usually a best practice, but if you're using Python
# older than 2.6 or if you're running on App Engine, you should set the
# chunksize to something like 1024 * 1024 (1 megabyte).
media_body=MediaFileUpload(file, chunksize=-1, resumable=True)
)
vid=resumable_upload(insert_request)
#Here I added lines to add video to playlist
#add_video_to_playlist(youtube,vid,"PL2JW1S4IMwYubm06iDKfDsmWVB-J8funQ")
#youtube = get_authenticated_service()
add_video_request=youtube.playlistItems().insert(
part="snippet",
body={
'snippet': {
'playlistId': "PL2JW1S4IMwYubm06iDKfDsmWVB-J8funQ",
'resourceId': {
'kind': 'youtube#video',
'videoId': vid
}
#'position': 0
}
}
).execute()
def resumable_upload(insert_request):
response = None
error = None
retry = 0
vid=None
while response is None:
try:
print "Uploading file..."
status, response = insert_request.next_chunk()
if 'id' in response:
print "'%s' (video id: %s) was successfully uploaded." % (
title, response['id'])
vid=response['id']
else:
exit("The upload failed with an unexpected response: %s" % response)
except HttpError, e:
if e.resp.status in RETRIABLE_STATUS_CODES:
error = "A retriable HTTP error %d occurred:\n%s" % (e.resp.status,
e.content)
else:
raise
except RETRIABLE_EXCEPTIONS, e:
error = "A retriable error occurred: %s" % e
if error is not None:
print error
retry += 1
if retry > MAX_RETRIES:
exit("No longer attempting to retry.")
max_sleep = 2 ** retry
sleep_seconds = random.random() * max_sleep
print "Sleeping %f seconds and then retrying..." % sleep_seconds
time.sleep(sleep_seconds)
return vid
if __name__ == '__main__':
title="sample title"
description="sample description"
keywords="keyword1,keyword2,keyword3"
privacyStatus="public"
file="myfile.mp4"
vid=initialize_upload(title,description,keywords,privacyStatus,file)
print 'video ID is :',vid
I am not able to figure out what is wrong. I am getting permission error. both script works fine independently.
could anyone help me figure out where I am wrong or how to achieve uploading video and adding that too playlist.

I got the answer actually in both the independent script scope is different.
scope for uploading is "https://www.googleapis.com/auth/youtube.upload"
scope for adding to playlist is "https://www.googleapis.com/auth/youtube"
as scope is different so I had to handle authentication separately.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

scrapy shell enable javascript - python-2.7

Related

HTTPSConnectionPool(host='localhost', port=8000): Max retries exceeded with url:/index

Flask: Unknown error although 200 response code is returned

Scrapy parse iframe url

HTTP Deadline exceeded waiting for python Google Cloud Endpoints on python client localhost

Uploading video to YouTube and adding it to playlist using YouTube Data API v3 in Python

Categories

Resources