Spider not found in scrapyd.schedule - django

i'm trying to start scrapyd from django
The scrapyd code is like this
unique_id = str(uuid4()) # create a unique ID.
settings = {
'unique_id': unique_id, # unique ID for each record for DB
'USER_AGENT': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
}
task = scrapyd.schedule('scrap_lowongan','josbid', settings=settings)
However, i'm getting
scrapyd_api.exceptions.ScrapydResponseError: spider 'josbid' not found
My folder structure is something like this
Bitalisy>
Bitalisy
Scraping>
views.py (Schedule scrapyd from here)
scrap_lowongan> (scrapy Project)
scrap_lowongan>
spider>
jobsid.py
settings.py
pipelines.py
scrapyd.conf
scrapy.cfg
Note that i'm using scrapyd.conf because i have two scrapy project. The scrapy.conf
[scrapyd]
http_port = 6801
Thank you

I have found that you must add :
scrapyd = ScrapydAPI('http://localhost:6801')
And after restarting scrapyd it's working
like charm. Read more the documentation here

Related

Dialogflow: Agent metadata not found for agentId

I'm trying to use Dialogflow's detect_intent in Python and I keep getting:
404 com.google.apps.framework.request.NotFoundException: Agent metadata not found for agentId: ####-####-####-####-####
Here's a snippet of my code:
import google.cloud.dialogflow as dialogflow
from CONFIG import DIALOGFLOW_PROJECT_ID
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'credentials/dialogflow.json'
def predict_intent(text, language):
session_client = dialogflow.SessionsClient()
session = session_client.session_path(DIALOGFLOW_PROJECT_ID, SESSION_ID)
text_input = dialogflow.TextInput(text=text, language_code=language)
query_input = dialogflow.QueryInput(text=text_input)
response = session_client.detect_intent(session=session, query_input=query_input) # ERROR
return response.query_result.intent.display_name
I tried running the function multiple times and some of them succeed, but most fall in the exception.
I can train the bot using the same interface and it works fine.
I'm using Python 3.7 and the following Google Cloud modules: google-api-core==2.0.1, google-auth==2.0.2, google-cloud-dialogflow==2.7.1, googleapis-common-protos==1.53.0.

HTTP Deadline exceeded waiting for python Google Cloud Endpoints on python client localhost

I want to build a python client to talk to my python Google Cloud Endpoints API. My simple HelloWorld example is suffering from an HTTPException in the python client and I can't figure out why.
I've setup simple examples as suggested in this extremely helpful thread. The GAE Endpoints API is running on localhost:8080 with no problems - I can successfully access it in the API Explorer. Before I added the offending service = build() line, my simple client ran fine on localhost:8080.
When trying to get the client to talk to the endpoints API, I get the following error:
File "/Applications/GoogleAppEngineLauncher.app/Contents/Resources/GoogleAppEngine-default.bundle/Contents/Resources/google_appengine/google/appengine/dist27/gae_override/httplib.py", line 526, in getresponse
raise HTTPException(str(e))
HTTPException: Deadline exceeded while waiting for HTTP response from URL: http://localhost:8080/_ah/api/discovery/v1/apis/helloworldendpoints/v1/rest?userIp=%3A%3A1
I've tried extending the http deadline. Not only did that not help, but such a simple first call on localhost should not be exceeding a default 5s deadline. I've also tried accessing the discovery URL directly within a browser and that works fine, too.
Here is my simple code. First the client, main.py:
import webapp2
import os
import httplib2
from apiclient.discovery import build
http = httplib2.Http()
# HTTPException happens on the following line:
# Note that I am using http, not https
service = build("helloworldendpoints", "v1", http=http,
discoveryServiceUrl=("http://localhost:8080/_ah/api/discovery/v1/apis/{api}/{apiVersion}/rest"))
# result = service.resource().method([parameters]).execute()
class MainPage(webapp2.RequestHandler):
def get(self):
self.response.headers['Content-type'] = 'text/plain'
self.response.out.write("Hey, this is working!")
app = webapp2.WSGIApplication(
[('/', MainPage)],
debug=True)
Here's the Hello World endpoint, helloworld.py:
"""Hello World API implemented using Google Cloud Endpoints.
Contains declarations of endpoint, endpoint methods,
as well as the ProtoRPC message class and container required
for endpoint method definition.
"""
import endpoints
from protorpc import messages
from protorpc import message_types
from protorpc import remote
# If the request contains path or querystring arguments,
# you cannot use a simple Message class.
# Instead, you must use a ResourceContainer class
REQUEST_CONTAINER = endpoints.ResourceContainer(
message_types.VoidMessage,
name=messages.StringField(1),
)
package = 'Hello'
class Hello(messages.Message):
"""String that stores a message."""
greeting = messages.StringField(1)
#endpoints.api(name='helloworldendpoints', version='v1')
class HelloWorldApi(remote.Service):
"""Helloworld API v1."""
#endpoints.method(message_types.VoidMessage, Hello,
path = "sayHello", http_method='GET', name = "sayHello")
def say_hello(self, request):
return Hello(greeting="Hello World")
#endpoints.method(REQUEST_CONTAINER, Hello,
path = "sayHelloByName", http_method='GET', name = "sayHelloByName")
def say_hello_by_name(self, request):
greet = "Hello {}".format(request.name)
return Hello(greeting=greet)
api = endpoints.api_server([HelloWorldApi])
Finally, here is my app.yaml file:
application: <<my web client id removed for stack overflow>>
version: 1
runtime: python27
api_version: 1
threadsafe: yes
handlers:
- url: /_ah/spi/.*
script: helloworld.api
secure: always
# catchall - must come last!
- url: /.*
script: main.app
secure: always
libraries:
- name: endpoints
version: latest
- name: webapp2
version: latest
Why am I getting an HTTP Deadline Exceeded and how to I fix it?
On your main.py you forgot to add some variables to your discovery service url string, or you just copied the code here without it. By the looks of it you were probably suppose to use the format string method.
"http://localhost:8080/_ah/api/discovery/v1/apis/{api}/{apiVersion}/rest".format(api='helloworldendpoints', apiVersion="v1")
By looking at the logs you'll probably see something like this:
INFO 2015-11-19 18:44:51,562 module.py:794] default: "GET /HTTP/1.1" 500 -
INFO 2015-11-19 18:44:51,595 module.py:794] default: "POST /_ah/spi/BackendService.getApiConfigs HTTP/1.1" 200 3109
INFO 2015-11-19 18:44:52,110 module.py:794] default: "GET /_ah/api/discovery/v1/apis/helloworldendpoints/v1/rest?userIp=127.0.0.1 HTTP/1.1" 200 3719
It's timing out first and then "working".
Move the service discovery request inside the request handler:
class MainPage(webapp2.RequestHandler):
def get(self):
service = build("helloworldendpoints", "v1",
http=http,
discoveryServiceUrl=("http://localhost:8080/_ah/api/discovery/v1/apis/{api}/{apiVersion}/rest")
.format(api='helloworldendpoints', apiVersion='v1'))

POST Request to Heroku in Python - 403 Forbidden

I'm learning web scraping and building a simple web app at the moment, and I decided to practice scraping a schedule of classes. Here's a code snippet I'm having trouble with in my application, using Python 2.7.4, Flask, Heroku, BeautifulSoup4, and Requests.
import requests
from bs4 import BeautifulSoup as Soup
url = "https://telebears.berkeley.edu/enrollment-osoc/osc"
code = "26187"
values = dict(_InField1 = "RESTRIC", _InField2 = code, _InField3 = "13D2")
html = requests.post(url, params=values)
soup = Soup(html.content, from_encoding="utf-8")
sp = soup.find_all("div", {"class" : "layout-div"})[2]
print sp.text
This works great locally. It gives me back the string "Computer Science 61A P 001 LEC:" as expected. However, when I tried to run it on Heroku (using heroku run bash and then run python), I got back an error,403 Forbidden.
Am I missing some settings on Heroku? At first I thought it's the school settings, but then I was wondering why it works locally without any trouble... Any explanation/suggestion would be really appreciated! Thank you in advance.
I was having a similar issue, request was working locally but getting blocked on Heroku. It looks like the issue is that some websites block requests coming from Heroku (which on on AWS Servers). To get around this you can send your requests via a proxy server.
There are a bunch of different add-ons in heroku to achieve this, I went with fixie which has a reasonably sized free tier.
To install:
heroku addons:create fixie:tricycle
Then import into your local environment so you can try locally:
heroku config -s | grep FIXIE_URL >> .env
then in your python file you just add a couple of lines:
import os
import requests
from bs4 import BeautifulSoup as Soup
proxyDict = {
"http" : os.environ.get('FIXIE_URL', ''),
"https" : os.environ.get('FIXIE_URL', '')
}
url = "https://telebears.berkeley.edu/enrollment-osoc/osc"
code = "26187"
values = dict(_InField1 = "RESTRIC", _InField2 = code, _InField3 = "13D2")
html = requests.post(url, params=values, proxies=proxyDict)
soup = Soup(html.content, from_encoding="utf-8")
sp = soup.find_all("div", {"class" : "layout-div"})[2]
print sp.text
Docs for Fixie are here:
https://devcenter.heroku.com/articles/fixie

How to limit Django Auth system to specific LDAP group with django_auth_ldap

I am using OpenLDAP and I would like to connect it to Django using django_auth_ldap. Whatever option I am trying to follow, it never works properly and I can't find the correct solution.
Here are the versions of the various softwares used:
OpenLDAP 20423 (LDAPv3)
Django 1.4.1 or 1.4.3 (I tried with both)
django_auth_ldap 1.1.3
My LDAP directory has a user called noc.noc that I am using to do the test.
I updated my settings.py file with the following lines:
import ldap, logging
from django_auth_ldap.config import LDAPSearch, PosixGroupType
.
.
.
logger = logging.getLogger('django_auth_ldap')
logger.addHandler(logging.StreamHandler())
logger.setLevel(logging.DEBUG)
AUTHENTICATION_BACKENDS = (
'django_auth_ldap.backend.LDAPBackend',
'django.contrib.auth.backends.ModelBackend',
)
AUTH_LDAP_SERVER_URI = "ldap://ldap.XXX.XX"
AUTH_LDAP_BIND_DN = "cn=<LDAPUSER>,dc=XXX,dc=XX"
AUTH_LDAP_BIND_PASSWORD = "<LDAPPASSWORD>"
AUTH_LDAP_USER_SEARCH = LDAPSearch("dc=XXX,dc=XX",
ldap.SCOPE_SUBTREE, "(uid=%(user)s)")
AUTH_LDAP_GROUP_TYPE = PosixGroupType(name_attr='cn')
AUTH_LDAP_REQUIRE_GROUP = "cn=<USERGROUP>,ou=Groups,dc=XXX,dc=XX"
AUTH_LDAP_GROUP_SEARCH = LDAPSearch("ou=Groups,dc=XXX,dc=XX",
ldap.SCOPE_SUBTREE, "(objectClass=posixGroup)"
)
AUTH_LDAP_USER_ATTR_MAP = {
'first_name': 'givenName',
'last_name': 'sn',
'email': 'mail',
}
And I have the following log message when I try to connect with the noc.noc user on my Django interface:
search_s('dc=XXX,dc=XX', 2, '(uid=%(user)s)') returned 1 objects:
cn=noc.noc,ou=users,dc=XXX,dc=XX
cn=noc.noc,ou=users,dc=XXX,dc=XX is not a member of cn=<USERGROUP>,ou=groups,dc=XXX,dc=XX
Authentication failed for noc.noc
If I remove the line:
AUTH_LDAP_REQUIRE_GROUP = "cn=<USERGROUP>,ou=Groups,dc=XXX,dc=XX"
The connection to the interface works, but it also work with any user in the LDAP database, which is not what I am looking for.
I also checked that the user is well in the ldap database and in the correct group with the following command:
ldapsearch -h 'ldap.XXX.XX' -D 'cn=<LDAPUSER>,dc=XXX,dc=XX' -w '<LDAPPASSWORD>' -b 'cn=<USERGROUP>,ou=Groups,dc=XXX,dc=XX'
And the result is:
# extended LDIF
#
# LDAPv3
# base <cn=<USERGROUP>,ou=Groups,dc=XXX,dc=XX> with scope subtree
# filter: (objectclass=*)
# requesting: ALL
#
# <USERGROUP>, Groups, XXX.XX
dn: cn=<USERGROUP>,ou=Groups,dc=XXX,dc=XX
cn: <USERGROUP>
gidNumber: 501
memberUid: cn=user1,ou=Users,dc=XXX,dc=XX
memberUid: cn=noc.noc,ou=Users,dc=XXX,dc=XX
objectClass: posixGroup
# search result
search: 2
result: 0 Success
# numResponses: 2
# numEntries: 1
I tried other parameters in settings.py and look for similar problem on the web but no solution I read solved my problem.
Thanks a lot for your help.
Django-auth-ldap is using the CN to identify a group member instead of the DN. For this reason, the solution is to replace:
cn=noc.noc,ou=Users,dc=XXX,dc=XX
in the memberUid field of the group in LDAP by:
noc.noc
This solves the issue.
Another solution would be to modify the sources of django-auth-ldap. To do so you have to edit the file config.py (found in /usr/local/lib/python2.6/dist-packages/django-auth-ldap on my installation) and find the function called:
def user_groups(self, ldap_user, group_search):
of the class:
class PosixGroupType(LDAPGroupType):
then replace the line:
user_uid = ldap_user.attrs['uid'][0]
by:
user_uid = ldap_user.dn
This should also do the trick.

Working with django : Proxy setup

I have a local development django setup with apache. The problem is that on the deployment server there is no proxy while at my workplace I work behind a http proxy, hence the request calls fail.
Is there any way of making all calls from requests library go via proxy. [ I know how to add proxy to individual calls using the proxies parameter but is there a global solution ? ]
I got the same error reported by AmrFouad. At last, it fixed by updating wsgi.py as follows:
os.environ['http_proxy'] = "http://proxy.xxx:8080"
os.environ['https_proxy'] = "http://proxy.xxx:8080"
Add following lines in your wsgi file.
import os
http_proxy = "10.10.1.10:3128"
https_proxy = "10.10.1.11:1080"
ftp_proxy = "10.10.1.10:3128"
proxyDict = {
"http" : http_proxy,
"https" : https_proxy,
"ftp" : ftp_proxy
}
os.environ["PROXIES"] = proxyDict
And Now you can use this environment variable anywhere you want,
r = requests.get(url, headers=headers, proxies=os.environ.get("PROXIES"))
P.S. - You should have a look at following links
Official Python Documentation for Environment Variables
Where and how do I set an environmental variable using mod-wsgi and django?
Python ENVIRONMENT variables
UPDATE 1
You can do something like following so that proxy settings are only being used on localhost.
import socket
if socket.gethostname() == "localhost":
# do something only on local server, e.g. setting os.environ["PROXIES"]
os.environ["PROXIES"] = proxyDict
else:
# Set os.environ["PROXIES"] to an empty dictionary on other hosts
os.environ["PROXIES"] = {}