Trouble with urllib calls in Python. Getting server error - python-2.7

I am trying to download an XML file from the Eurostat website but I am having trouble using urllib in Python to do it. Somehow when I use my regular Chrome browser it's able to make the HTTP request and the website will generate an XML file, but when I try to do the same thing in python I get a server error. This is the code I am using:
import urllib
from xml.etree import ElementTree as ET
response = urllib.urlopen("http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/lfsq_egais/Q.T.Y_GE15.EMP..NL")
result = response.read()
print result
I have tried using urllib.urlretrieve too and that didn't work either. Any reason why this might be happening? The HTML I get back is as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Draft//EN">
<HTML>
<HEAD>
<TITLE>Error 500--Internal Server Error</TITLE>
<META NAME="GENERATOR" CONTENT="WebLogic Server">
</HEAD>
<BODY bgcolor="white">
<FONT FACE=Helvetica><BR CLEAR=all>
<TABLE border=0 cellspacing=5><TR><TD><BR CLEAR=all>
<FONT FACE="Helvetica" COLOR="black" SIZE="3"><H2>Error 500--Internal Server Error</H2>
</FONT></TD></TR>
</TABLE>
<TABLE border=0 width=100% cellpadding=10><TR><TD VALIGN=top WIDTH=100% BGCOLOR=white><FONT FACE="Courier New"><FONT FACE="Helvetica" SIZE="3"><H3>From RFC 2068 <i>Hypertext Transfer Protocol -- HTTP/1.1</i>:</H3>
</FONT><FONT FACE="Helvetica" SIZE="3"><H4>10.5.1 500 Internal Server Error</H4>
</FONT><P><FONT FACE="Courier New">The server encountered an unexpected condition which prevented it from fulfilling the request.</FONT></P>
</FONT></TD></TR>
</TABLE>
</BODY>
</HTML>

This question is a few months old now, but better late than never:
The Eurostat REST API you are talking is supposed to respond with XML content, which urllib is not expecting/allowing by default. The solution is to add a header Accept: application/xml to the request.
This will do the trick in Python 2.7 (using urllib2 by the way):
import urllib2
req = urllib2.Request("http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/"
"lfsq_egais/Q.T.Y_GE15.EMP..NL")
req.add_header("Accept", "application/xml")
response = urllib2.urlopen(req)
print response.read()
See urllib2 docs for more info and examples.

Related

Google Compute Engine aiohttp get requests recaptcha

I am trying to send a get request on Google Compute Engine (GCE) to Newegg using aiohttp. Upon doing so, I get back the webpage "Are you a human." However, when I run the same exact code on my local machine, I am able to retrieve the page just fine. Does anyone know why:
I only get the Recaptcha page with GCE, but not my local machine?
Is there any way to avoid or get around this Recaptcha page on GCE?
my code:
import asyncio from bs4 import BeautifulSoup import aiohttp
async def myDriver():
await httpReq()
async def httpReq():
async with aiohttp.ClientSession() as session:
async with session.get("https://www.newegg.com/") as page:
responseCode = page.status
print(responseCode)
pageContent = await page.text()
content = BeautifulSoup(pageContent, 'lxml')
print(content.prettify())
asyncio.run(myDriver())
page reached:
200
<!DOCTYPE html>
<html lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>
Are you a human?
</title>
.
.
.
grecaptcha.ready(function()
Notes:
"Debian GNU/Linux 10 (buster)"
python 3.7.3
aiohttp 3.6.3
I've tried similar code with the normal requests library on GCE, and everything works fine, so this is only an issue with aiohttp on GCE.
I must use aiohttp and not the normal requests library for my project

Reverse geolocation. Loading api/geocode I get SyntaxError: Unexpected token ':'. Parse error

This is​​ a single script.php only to load data.
<!doctype html>
<html>
<head>
<meta charset="UTF-8">
<title>TEST</title>
</head>
<body>
<div id="location">
<script src="https://maps.googleapis.com/maps/api/geocode/json?latlng=42.149247222222,24.752305555556&key=My-enabled-key-here">
</script>
</div>
</body>
</html>
In Mac Safari I get I get `SyntaxError: Unexpected token ':'. Parse error.
Several data is loaded I see it in Safari debugger starting in:
but I cannot use it because of that error message.
In Chrome and Opera I get Cross-Origin Read Blocking (CORB) blocked cross-origin response with MIME type application/json.
Reading some old questions I added
<?php header('Access-Control-Allow-Origin: http://example.com') ?>
and then replaced by
<?php header('Access-Control-Allow-Origin: *') ?>
as the first line but nothing changed.
From Google side: Key restrictions
-> Application restrictions: none.
-> API restrictions: yes (key is accepted for 4 APIs, one of them is Geocoding API
What's wrong here?
Since you are making a Geocoding web service request in the client-side (front-end) that's why you are getting the Cross-origin blocking error (CORB). Web service requests are meant to be executed server side.
Note that if you intend to use Geocoding in client-side, the JavaScript API has a Geocoding Service (which prevents the CORB issue). Please refer to this guide: https://developers.google.com/maps/documentation/javascript/geocoding
Hope this helps!

NTLM with Postman shows "JSONError | Unexpected token '<' at 1:1 "

I have script to do API automation by fetching data from CSV file and comparing JSON response with the data in CSV file in POSTMAN. I have 12 scenarios/ iterations to verify and each scenario sends more than 20 data picking from CSV file and from JSON response compared more than 10 data. Everything is working fine.
Now Security feature implemented in code, so I have to send the request and Automate the script with ID/PWD. So I used NTML authentication with ID and PWD.
When I run the script with runner, initial two iteration gives perfect response and script passed, then from 3rd iteration all script failed and not getting response. In response it says Data unavailable when I checked in postman console it shows below details.
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"/>
<title>401 - Unauthorized: Access is denied due to invalid credentials.</title>
<style type="text/css">
<!--
body{margin:0;font-size:.7em;font-family:Verdana, Arial, Helvetica, sans-serif;background:#EEEEEE;}
fieldset{padding:0 15px 10px 15px;}
h1{font-size:2.4em;margin:0;color:#FFF;}
h2{font-size:1.7em;margin:0;color:#CC0000;}
h3{font-size:1.2em;margin:10px 0 0 0;color:#000000;}
#header{width:96%;margin:0 0 0 0;padding:6px 2% 6px 2%;font-family:"trebuchet MS", Verdana, sans-serif;color:#FFF;
background-color:#555555;}
#content{margin:0 0 0 2%;position:relative;}
.content-container{background:#FFF;width:96%;margin-top:8px;padding:10px;position:relative;}
-->
</style>
</head>
<body>
<div id="header"><h1>Server Error</h1></div>
<div id="content">
<div class="content-container"><fieldset>
<h2>401 - Unauthorized: Access is denied due to invalid credentials.</h2>
<h3>You do not have permission to view this directory or page using the credentials that you supplied.</h3>
</fieldset></div>
</div>
</body>
</html>
What could be the reason and any solution to solve this.
Screen short
Used NTLM Authentication [BETA] authorization option with ID/PWD
Here are the details for pass scenario
Request Headers:
content-type:"application/json"
cache-control:"no-cache"
user-agent:"PostmanRuntime/7.1.5"
accept:"*/*"
host:"xxxxxx"
accept-encoding:"gzip, deflate"
content-length:599
authorization:"NTLM TlRMTVNTUAADAAAAGAAYAFIAAAAYABgAagAAAAAAAABIAAAACgAKAEgAAAAAAAAAUgAAAAAAAACCAAAABYKIogUBKAoAAAAPUAAzAFcATABJAPxv7ESeMEwAAAAAAAAAAAAAAAAAAAAAAHZECYztsK+qnjG5K0DvDIPzQ09CFXWo0Q=="
Request Body:
Response Headers:
transfer-encoding:"chunked"
content-type:"application/json; charset=utf-8"
location:"xxxxxx/api/rate/zzz"
server:"Kestrel"
persistent-auth:"true"
date:"Wed, 06 Jun 2018 13:40:05 GMT"
Response Body:
rate:5
retailRateAttributes:
error:null
Here are the details of Failed scenario
Request Headers:
content-type:"application/json"
cache-control:"no-cache"
authorization:"NTLM TlRMTVNTUAADAAAAGAAYAFIAAAAYABgAagAAAAAAAABIAAAACgAKAEgAAAAAAAAAUgAAAAAAAACCAAAABYKIogUBKAoAAAAPUAAzAFcATABJAPxv7ESeMEwAAAAAAAAAAAAAAAAAAAAAAHZECYztsK+qnjG5K0DvDIPzQ09CFXWo0Q=="
user-agent:"PostmanRuntime/7.1.5"
accept:"*/*"
host:""xxxxxx""
accept-encoding:"gzip, deflate"
content-length:599
Request Body:
Response Headers:
content-type:"text/html"
server:"Microsoft-IIS/10.0"
www-authenticate:
0:"Negotiate"
1:"NTLM"
date:"Wed, 06 Jun 2018 13:40:05 GMT"
content-length:"1293"
Response Body:
While Postman errors are not the most descriptive, this error typically occurs because your API endpoint does not exist. You may want to check that you are correctly calling the appropriate endpoint
You say that the two first iterations work fine but when you get to the third iteration, get the error. That sounds like the auth/token/session expired.
I got the Postman error message
JSONError: Unexpected token '<' at 1:1<!doctype html>^ today.
I realized that the problem (in my case) was that I tried to access an API that
I had written myself, but forgotten to upload. - Thus, I tried to call an API
that did not exist. (!)
As soon as I uploaded the API, the error went away.

Parse HTML with BeautifulSoup replaces existing HTML tag

I am using BeautifulSoup v4 to parse out a string of HTML that looks like this:
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office">
<head></head>
<body><p>Hello, world</p></body>
</html>
Here is how I am parsing it:
soup = BeautifulSoup(html)
Where html is the pasted HTML above. For whatever reason, BS keeps replaces the <html> tag with a standard tag without the extra meta info. Any way I can tell BS to not do this?
I was able to figure it out by passing in html5lib as the HTML parser to BS. But, now, it keeps dropping in a random HTML comment tag for the DOCTYPE
<!--<!DOCTYPE HTML-->

Can I use an <img> tag to send cookies across domains?

Look at this situation:
www.websitea.com displays an img tag with a src attribute of www.websiteb.com/image.aspx?id=5 and style="display:none"
www.websiteb.com returns an clear image, in addition to a cookie with a name of referrer and value of 5 (created server-side from validated querystring.)
Would the cookie be created on domain www.websitea.com or www.websiteb.com?
Currently I'm sure a series of redirects with querystrings and to achieve cross-domain cookies, but I came up with this image idea a little ago. I guess I could also use an iframe.
Thanks!
Check out:
cross-domain-user-tracking
Someone mentions using a 1x1 image for tracking across domains.
The cookie would be created for websiteb.com.
The cookie is created from the request to websiteb.com so yea... the cookie goes to websiteb scope
You're on the right track. As others have mentioned, the cookie would be created for websiteb.com.
To overcome issues with IE you'll probably need to ad a Compact Privacy policy.
Start here: http://msdn.microsoft.com/en-us/library/ms537342.aspx and Google for the rest.
Ok looks good. Tested in all browsers. Added a P3P tag for IE6, not sure if it was necessary though.
<%# Page Language="VB" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<script runat="server">
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs)
Response.AddHeader("P3P", "CP=""CAO PSA OUR""")
Dim passedlocalizeID As String = Request.QueryString("id")
Dim localizeID As Integer
If passedlocalizeID IsNot Nothing AndAlso Int32.TryParse(passedlocalizeID, localizeID) Then
Dim localizer As New Localizer
localizer.LocalizeTo(localizeID)
End If
End Sub
</script>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title>Redirecting . . .</title>
<meta http-equiv="refresh" content="0;URL=/" />
</head>
<body>
<form id="form1" runat="server">
<div>
</div>
</form>
</body>
</html>