Why does urllib2 not take my URL? - python-2.7

I have got URLs stored in a list which are supposed to be passed to urllib2. However, urllib doesn't seem to like this very much and I just cannot see why!
here is what I have got:
url = list[1]
response = urllib2.urlopen(url)
html = response.read()
The URL is a google maps Directions Web API URL of the kind:
http://maps.googleapis.com/maps/api/directions/json?origin=[origin]&destination=[destination]&waypoints=optimize:true|[waypoint1]|[waypoint2]&sensor=false
Now, if I try to run this, the retrieved html always looks something like this:
{
"routes" : [],
"status" : "INVALID_REQUEST"
}
Indicating that something is wrong with the passed URL. If however, I take the URL and assign it directly, like so:
url = "http://maps.googleapis.com/maps/api/directions/json?origin=[origin]&destination=[destination]&waypoints=optimize:true|[waypoint1]|[waypoint2]&sensor=false"
response = urllib2.urlopen(url)
html = response.read()
The result will happily come through with the (for me) essential end part looking like this:
"warnings" : [],
"waypoint_order" : [ 2, 0, 7, 5, 6, 4, 3, 1 ]
}
],
"status" : "OK"
}
My (hopefully not too stupid) question is therefore: why does urllib do its job when the URL is assigned directly but not if it comes from a list??
Thank you very much for your help,
J
PS: Is there a reason why the forum-software always cuts off my 'Hi all' greeting?

I have finally resolved the issue!
Was pretty easy once I stepped back for a moment and thought about it again:
I did get an answer from google's Web API which means that urllib2 did pass something. However, it must have been sending something that didn't make any sense to the API backend. The URL containing German addresses I lucky-guessed the problem being caused by umlauts.
So I simply passed my URL through a function replacing each umlaut with its non-umlaut alternative and suddenly everything seemed to work fine.
If anybody comes accross a similar issue, here's how I solved it:
# Function for replacing all umlauts
def replaceUmlauts(text):
dic = {'Ä':'Ae', 'ä':'ae', 'Ö':'Oe', 'ö':'oe', 'Ü':'Ue', 'ü':'ue', 'ß':'ss'}
for i, j in dic.iteritems():
text = text.replace(i, j)
return text
Then simply use
response = urllib2.urlopen(replaceUmlauts(url))

Related

Django query param get stripped if there is (+) sign

Whenever I try to to get my query string parameter everything works but only + sign gets stripped.
Here is url file:
urlpatterns = [
re_path(r'^forecast/(?P<city>[\w|\W]+)/$', weather_service_api_views.getCurrentWeather)]
Here is view File:
#api_view(['GET'])
def getCurrentWeather(request, city):
at = request.GET["at"]
print(at)
return JsonResponse({"status": "ok"}, status=200)
So if I hit the server with this URL:
http://192.168.0.5:8282/forecast/Bangladesh/?at=2018-10-14T14:34:40+0100
the output of at is like this:
2018-10-14T14:34:40 0100
Always + sign gets stripped. No other characters get stripped. I have used characters like !, = , - etc.
Since + is a special character, you will have to encode your value. Where to encode? it depends how are you generating the values for at. Based on your URL's and endpoints it looks like you are working on a weather app and at value is generated by Javascript. You can encode your values with encodeURIComponent
let at = encodeURIComponent(<your_existing_logic>)
eg:
let at = encodeURIComponent('2018-10-14T14:34:40+0100')
it will return a result
'2018-10-14T14%3A34%3A40%2B0100'
then in your backend you can get that value with:
at = request.GET.get('at')
it will give you the desired value, 2018-10-14T14:34:40+0100 in this case.
If you are creating your at param in your backend, then there are multiple ways to achieve that. You can look into this solution:
How to percent-encode URL parameters in Python?

Obfuscating a string that can afterwards be used in a web address

I have a small simple server in Flask, and I'd like to be able to route to a user's page with the following route:
#app.route("/something/<string:username>", methods=["GET"])
When it's a clear username it's not a problem, however I want to add simple obfuscation so that when given a key produces a new string that can still be used in a web address.
I tried my luck with several methods I found in Stack Overflow, but the output strings have various issues like non-ASCII characters, or characters that give me issues in the routing (like having a / which confuses Flask).
Ideally I'd like to have two functions, obfuscate(key, string) and deobfuscate(key, string) so I'll be able to use like so:
#app.route("/something/<string:username>", methods=["GET"])
def user_page(username):
# username is an obfuscated string
clear_username = deobfuscate(MY_KEY, username)
return flask.make_response("Hi {}".format(clear_username), 200)
...
...
def create_user(username):
# username is a clear string
save_to_database(username)
return obfuscate(MY_KEY, username)
To summarize, the obfuscation needs to be simple but good enough that you won't be able to figure it out by looking at the URL, and two-way so that I can figure out what the original string was and print it out.
I ended up solving the issue with itsdangerous, which is a dependency of Flask so I have it on my server anyway.
As the example here shows:
>>> from itsdangerous import URLSafeSerializer
>>> s = URLSafeSerializer('secret-key')
>>> s.dumps([1, 2, 3, 4])
'WzEsMiwzLDRd.wSPHqC0gR7VUqivlSukJ0IeTDgo'
>>> s.loads('WzEsMiwzLDRd.wSPHqC0gR7VUqivlSukJ0IeTDgo')
[1, 2, 3, 4]
It's safe to assume I won't have any surprises as the docstring says:
Works like :class:Serializer but dumps and loads into a URL safe string consisting of the upper and lowercase character of the alphabet as well as _, - and ..

Scrapy webcrawler gets caught in infinite loop, despite initially working.

Alright, so I'm working on a scrapy based webcrawler, with some simple functionalities. The bot is supposed to go from page to page, parsing and then downloading. I've gotten the parser to work, I've gotten the downloading to work. I can't get the crawling to work. I've read the documentation on the Spider class, I've read the documentation on how parse is supposed to work. I've tried returning vs yielding, and I'm still nowhere. I have no idea where my code is going wrong. What seems to happen, from a debug script I wrote is the following. The code will run, it will grab page 1 just fine, it'll get the link to page two, it'll go to page two, and then it will happily stay on page two, not grabbing page three at all. I don't know where the mistake in my code is, or how to alter it to fix it. So any help would be appreciated. I'm sure the mistake is basic, but I can't figure out what's going on.
import scrapy
class ParadiseSpider(scrapy.Spider):
name = "testcrawl2"
start_urls = [
"http://forums.somethingawful.com/showthread.php?threadid=3755369&pagenumber=1",
]
def __init__(self):
self.found = 0
self.goto = "no"
def parse(self, response):
urlthing = response.xpath("//a[#title='Next page']").extract()
urlthing = urlthing.pop()
newurl = urlthing.split()
print newurl
url = newurl[1]
url = url.replace("href=", "")
url = url.replace('"', "")
url = "http://forums.somethingawful.com/" + url
print url
self.goto = url
return scrapy.Request(self.goto, callback=self.parse_save, dont_filter = True)
def parse_save(self, response):
nfound = str(self.found)
print "Testing" + nfound
self.found = self.found + 1
return scrapy.Request(self.goto, callback=self.parse, dont_filter = True)
Use Scrapy rule engine,So that don't need to write the next page crawling code in parse function.Just pass the xpath for the next page in the restrict_xpaths and parse function will get the response of the crawled page
rules=(Rule(LinkExtractor(restrict_xpaths= ['//a[contains(text(),"Next")]']),follow=True'),)
def parse(self,response):
response.url

Problems Scraping a Page With Beautiful Soup

I am using Beautiful Soup to try and scrape a page.
I am trying to follow this tutorial.
I am trying to get the contents of the following page after submitting a Stock Ticker Symbol:
http://www.cboe.com/delayedquote/quotetable.aspx
The tutorial is for a page with a "GET" method, my page is a "POST". I wonder if that is part of the problem?
I want use the first text box – under where it says:
“Enter a Stock or Index symbol below for delayed quotes.”
Relevant code:
user_agent = 'Mozilla/5 (Solaris 10) Gecko'
headers = { 'User-Agent' : user_agent }
values = {'ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol' : 'IBM' }
data = urllib.urlencode(values)
request = urllib2.Request("http://www.cboe.com/delayedquote/quotetable.aspx", data, headers)
response = urllib2.urlopen(request)
The call does not fail, I do not get a set of options and prices returned to me like when I run the page interactively. I a bunch of garbled HTML.
Thanks in advance!
Ok - I think I figured out the problem (and found another). I decided to switch to 'mechanize' from 'urllib2'. Unfortunately, I kept having problems getting the data. Finally, I realized that there are two 'submit' buttons, so I tried passing the name parameter when submitting the form. That did the trick as far as getting the correct response.
However, the next problem was that I could not get BeautifulSoup to parse the HTML and find the necessary tags. A brief Google search revealed others having similar problems. So, I gave up on BeautifulSoup and just did a basic regex on the HTML. Not as elegant as BeautifulSoup, but effective.
Ok - enough speechifying. Here's what I came up with:
import mechanize
import re
br = mechanize.Browser()
url = 'http://www.cboe.com/delayedquote/quotetable.aspx'
br.open(url)
br.select_form(name='aspnetForm')
br['ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$txtSymbol'] = 'IBM'
# here's the key step that was causing the trouble - pass the name parameter
# for the button when calling submit
response = br.submit(name="ctl00$ctl00$AllContent$ContentMain$ucQuoteTableCtl$btnSubmit")
data = response.read()
match = re.search( r'Bid</font><span> \s*([0-9]{1,4}\.[0-9]{2})', data, re.MULTILINE|re.M|re.I)
if match:
print match.group(1)
else:
print "There was a problem retrieving the quote"

Django: grabbing parameters

I'm having the hardest time with what should be super simple. I can't grab the passed parameters in django.
In the browser I type:
http://localhost:8000/mysite/getst/?term=hello
My url pattern is:
(r'^mysite/getst/$', 'tube.views.getsearchterms')
My View is
def getsearchterms(request):
my_term = some_way_to_get_term
return HttpResponse(my_term)
In this case it should return "hello". I am calling the view, but a blank value is returned to me. I've tried various forms of GET....
What should some_way_to_get_term be?
The get parameters can be accesses like any dictionary:
my_term = request.GET['term']
my_term = request.GET.get('term', 'my default term')
By using arbitrary arguments after ? and then catching them with request.GET['term'], you're missing the best features of Django urls module : a consistent URL scheme
If "term" is always present in this URL call it must be meaningful to your application,
so your url rule could look like :
(r'^mysite/getst/(?P<term>[a-z-.]+)/', 'tube.views.getsearchterms')
That means :
That you've got a more SEO-FRIENDLY AND stable URL scheme (no ?term=this&q=that inside)
That you can catch your argument easily in your view :
Like this
def getsearchterms(request,term):
#do wahtever you want with var term
print term