I'm learning web scraping and building a simple web app at the moment, and I decided to practice scraping a schedule of classes. Here's a code snippet I'm having trouble with in my application, using Python 2.7.4, Flask, Heroku, BeautifulSoup4, and Requests.
import requests
from bs4 import BeautifulSoup as Soup
url = "https://telebears.berkeley.edu/enrollment-osoc/osc"
code = "26187"
values = dict(_InField1 = "RESTRIC", _InField2 = code, _InField3 = "13D2")
html = requests.post(url, params=values)
soup = Soup(html.content, from_encoding="utf-8")
sp = soup.find_all("div", {"class" : "layout-div"})[2]
print sp.text
This works great locally. It gives me back the string "Computer Science 61A P 001 LEC:" as expected. However, when I tried to run it on Heroku (using heroku run bash and then run python), I got back an error,403 Forbidden.
Am I missing some settings on Heroku? At first I thought it's the school settings, but then I was wondering why it works locally without any trouble... Any explanation/suggestion would be really appreciated! Thank you in advance.
I was having a similar issue, request was working locally but getting blocked on Heroku. It looks like the issue is that some websites block requests coming from Heroku (which on on AWS Servers). To get around this you can send your requests via a proxy server.
There are a bunch of different add-ons in heroku to achieve this, I went with fixie which has a reasonably sized free tier.
To install:
heroku addons:create fixie:tricycle
Then import into your local environment so you can try locally:
heroku config -s | grep FIXIE_URL >> .env
then in your python file you just add a couple of lines:
import os
import requests
from bs4 import BeautifulSoup as Soup
proxyDict = {
"http" : os.environ.get('FIXIE_URL', ''),
"https" : os.environ.get('FIXIE_URL', '')
}
url = "https://telebears.berkeley.edu/enrollment-osoc/osc"
code = "26187"
values = dict(_InField1 = "RESTRIC", _InField2 = code, _InField3 = "13D2")
html = requests.post(url, params=values, proxies=proxyDict)
soup = Soup(html.content, from_encoding="utf-8")
sp = soup.find_all("div", {"class" : "layout-div"})[2]
print sp.text
Docs for Fixie are here:
https://devcenter.heroku.com/articles/fixie
Related
I'm trying to use Dialogflow's detect_intent in Python and I keep getting:
404 com.google.apps.framework.request.NotFoundException: Agent metadata not found for agentId: ####-####-####-####-####
Here's a snippet of my code:
import google.cloud.dialogflow as dialogflow
from CONFIG import DIALOGFLOW_PROJECT_ID
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = 'credentials/dialogflow.json'
def predict_intent(text, language):
session_client = dialogflow.SessionsClient()
session = session_client.session_path(DIALOGFLOW_PROJECT_ID, SESSION_ID)
text_input = dialogflow.TextInput(text=text, language_code=language)
query_input = dialogflow.QueryInput(text=text_input)
response = session_client.detect_intent(session=session, query_input=query_input) # ERROR
return response.query_result.intent.display_name
I tried running the function multiple times and some of them succeed, but most fall in the exception.
I can train the bot using the same interface and it works fine.
I'm using Python 3.7 and the following Google Cloud modules: google-api-core==2.0.1, google-auth==2.0.2, google-cloud-dialogflow==2.7.1, googleapis-common-protos==1.53.0.
I am using Python 2.7 and using google plus public API to get activity data in a file. I am encountering issues to maintain the json encoding in my file. Double quotes are coming as u'' in my file. Below is my code:
from apiclient import discovery
API_KEY = 'MY API KEY'
service = discovery.build("plus", "v1", developerKey=API_KEY)
activities_resource = service.activities()
request = activities_resource.search(query='India versus South Africa', maxResults=1, orderBy='best',)
while request!= None:
activities_document = request.execute()
if 'items' in activities_document:
with open("output.json", mode='a') as file:
data = str(activities_document['items'])
file.write(data +"\n\n")
request = service.activities().list_next(request, activities_document)
Output:
[{u'kind': u'plus#activity', u'provider': {u'title': u'Google+'}, u'titl.......
I am expecting [{"kind": "plus#activity", .....
I am running my code on windows and I have tried both on DOS and pycharm IDE. I have also run the code on ubuntu machine but same output. Please let me know what I am doing wrong.
The json module is used for generating JSON. Use it.
I'm working on a script that checks multiple domain names if they are registered or not. It loops through a list of the domains read from a file, and the registration check is done using enom's API. My problem is the code accessing the API in Python 2:
import urllib2
import xml.etree.ElementTree as ET
...
response = urllib2.urlopen(url)
content = ET.fromstring(response.read())
...
return content.find("RRPCode").text
generates the error: 'NoneType' object has no attribute 'text' in nearly 30% of the checks. While the Python 3 code:
import urllib.request
import xml.etree.ElementTree as ET
...
response = urllib.request.urlopen(url)
content = ET.fromstring(response.read())
...
return content.find("RRPCode").text
works fine. I should also mention that the number of errors returned are random and not related to specific domain names.
What could be the cause of these errors?
I am by the way using Python 2.7.3 and Python 3.2.3 on a VPS running Ubuntu 12.04 server.
Thanks.
I am facing a bit of a situation,
Scenario: I got a django rest api running on my localhost:8000 and I want to access the api using my command line. I have tried urllib2 and python requests libs to talk to the api but failed(i'm getting a 503 error). But when I pass google.com as the url, I am getting the expected response. So I believe my approach is correct but I'm doing something wrong. please see the code below :
import urllib, urllib2, httplib
url = 'http://localhost:8000'
httplib.HTTPConnection.debuglevel = 1
print "urllib"
data = urllib.urlopen(url);
print "urllib2"
request = urllib2.Request(url)
opener = urllib2.build_opener()
feeddata = opener.open(request).read()
print "End\n"
Envioroments:
OS Win7
python v2.7.5
Django==1.6
Markdown==2.3.1
colorconsole==0.6
django-filter==0.7
django-ping==0.2.0
djangorestframework==2.3.10
httplib2==0.8
ipython==1.0.0
jenkinsapi==0.2.14
names==0.3.0
phonenumbers==5.8b1
requests==2.1.0
simplejson==3.3.1
termcolor==1.1.0
virtualenv==1.10.1
Thanks
I had a similar problem, but found that it was the company's proxy that was preventing from pinging myself.
503 Reponse when trying to use python request on local website
Try:
>>> import requests
>>> session = requests.Session()
>>> session.trust_env = False
>>> r = session.get("http://localhost:5000/")
>>> r
<Response [200]>
>>> r.content
'Hello World!'
If you are registering your serializers with DefaultRouter then your api will appear at
http://localhost:8000/api/ for an html view of the index
http://localhost:8000/api/.json for a JSON view of the index
http://localhost:8000/api/appname for an html view of the individual resource
http://localhost:8000/api/appname/.json for a JSON view of the individual resource
you can check the response in your browser to make sure your URL is working as you expect.
When i run the server it shows me a message http://127.0.0.1:8000/. I want to get the url in code. I dont have request object there.
request.build_absolute_uri('/')
or you could try
import socket
socket.gethostbyname(socket.gethostname())
With Django docs here, You can use:
domain = request.get_host()
# domain = 'localhost:8000'
If you want to know exactly the IP address that the development server is started with, you can use sys.argv for this. The Django development server uses the same trick internally to restart the development server with the same arguments
Start development server:
manage.py runserver 127.0.0.1:8000
Get address in code:
if settings.DEBUG:
import sys
print sys.argv[-1]
This prints 127.0.0.1:8000
import socket
host = socket.gethostname()
import re
import subprocess
def get_ip_machine():
process = subprocess.Popen(['ifconfig'], stdout=subprocess.PIPE)
ip_regex = re.compile('(((1?[0-9]{1,2})|(2[0-4]\d)|(25[0-5]))\.){3}((1?[0-9]{1,2})|(2[0-4]\d)|(25[0-5]))')
return ip_regex.search(process.stdout.read(), re.MULTILINE).group()