Problems using extended escape mode for jsoup output - html-entities

I need to transform a HTML file, by removing certain tags from the file. To do this I have something like this -
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities;
import org.jsoup.nodes.Entities.EscapeMode;
import java.io.IOException;
import java.io.File;
import java.util.*;
public class TestJsoup {
public static void main(String[] args) throws IOException {
Validate.isTrue(args.length == 1, "usage: supply url to fetch");
String url = args[0];
Document doc = null;
if(url.contains("http")) {
doc = Jsoup.connect(url).get();
} else {
File f = new File(url);
doc = Jsoup.parse(f, null);
}
/* remove some tags */
doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
System.out.println(doc.html());
return;
}
}
The issue with the above code is that, when I use extended escape mode, the output has the html tag attributes being html encoded. Is there anyway to avoid this? Using escape mode as base or xhtml doesn't work as some of the non standard extended (like ’) encoding give problems. For ex for the HTML below,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>Test®</title>
</head>
<body style="background-color:#EDEDED;">
<P>
<font style="color:#003698; font-weight:bold;">Testing HTML encoding - ’ © with a link
</font>
<br />
</P>
</body>
</html>
The output I get is,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>&NewLine;
<title>Test®</title>&NewLine;
</head>&NewLine;
<body style="background-color&colon;&num;EDEDED&semi;">&NewLine;
<p>&NewLine; <font style="color&colon;&num;003698&semi; font-weight&colon;bold&semi;">Testing HTML encoding - &rsquor; © with a <a href="http&colon;&sol;&sol;www&period;g
oogle&period;com">link</a></font> <br />&NewLine;</p>&NewLine;&NewLine;&NewLine;&NewLine;
</body>
</html>
Is there anyway to get around this issue?

What output encoding character set are you using? (It will default to the input, which if you are loading from URLs, will vary according to the site).
You probably want to explicitly set it to either UTF-8, or ASCII or some other low setting if you are working with systems that cannot deal with UTF-8. If you set the escape mode to base (the default), and the charset to ascii, then any character (like rsquo) than cannot be represented natively in the selected charset will be output as a numerical escape.
For example:
String check = "<p>’ <a href='../'>Check</a></p>";
Document doc = Jsoup.parse(check);
doc.outputSettings().escapeMode(Entities.EscapeMode.base); // default
doc.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc.body().html());
doc.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc.body().html());
Gives:
UTF-8: <p>’ Check</p>
ASCII: <p>’ Check</p>
Hope this helps!

Related

Parse HTML with BeautifulSoup replaces existing HTML tag

I am using BeautifulSoup v4 to parse out a string of HTML that looks like this:
<!DOCTYPE HTML>
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office">
<head></head>
<body><p>Hello, world</p></body>
</html>
Here is how I am parsing it:
soup = BeautifulSoup(html)
Where html is the pasted HTML above. For whatever reason, BS keeps replaces the <html> tag with a standard tag without the extra meta info. Any way I can tell BS to not do this?
I was able to figure it out by passing in html5lib as the HTML parser to BS. But, now, it keeps dropping in a random HTML comment tag for the DOCTYPE
<!--<!DOCTYPE HTML-->

Python webbrowser not opening browser

I'm trying to familiarize myself with Python with a major concern for web publishing, so I looked around and found the following example. Running 2.7 in PyScripter on Windows 7 didn't bring up the browser as I expected. The code came up in Notepad++ instead, apparently since the html suffix was associated with Notepad. I tried about a dozen different permutations of the code, but the html file still opened in Notepad until I associated that file with Firefox. When I include the print webbrowser._browsers command, I get {'windows-default': [, None], 'c:\program files (x86)\internet explorer\iexplore.exe': [None, ]}
This would imply to me that IE should be the default browser being called, but obviously it is not. Can anyone enlighten me here as I am a Python newbie?
'''A simple program to create an html file froma given string,
and call the default web browser to display the file.'''
contents = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="content-type">
<title>Hello</title>
</head>
<body>
Hello, World!
</body>
</html>
'''
import webbrowser
def main():
browseLocal(contents)
def strToFile(text, filename):
"""Write a file with the given name and the given text."""
output = open(filename,"w")
output.write(text)
output.close()
def browseLocal(webpageText, filename='C:\\Python27\\Programs\\tempBrowseLocal.html'):
'''Start your webbrowser on a local file containing the text
with given filename.'''
strToFile(webpageText, filename)
print webbrowser._browsers
webbrowser.open(filename)
main()
def browseLocal(webpageText):#take filename out of here
'''Start your webbrowser on a local file containing the text
with given filename.'''
#define filename here.
filename='C:\\Python27\\Programs\\tempBrowseLocal.html'
strToFile(webpageText, filename)
print webbrowser._browsers
webbrowser.open(filename)

Trouble with urllib calls in Python. Getting server error

I am trying to download an XML file from the Eurostat website but I am having trouble using urllib in Python to do it. Somehow when I use my regular Chrome browser it's able to make the HTTP request and the website will generate an XML file, but when I try to do the same thing in python I get a server error. This is the code I am using:
import urllib
from xml.etree import ElementTree as ET
response = urllib.urlopen("http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/lfsq_egais/Q.T.Y_GE15.EMP..NL")
result = response.read()
print result
I have tried using urllib.urlretrieve too and that didn't work either. Any reason why this might be happening? The HTML I get back is as follows:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Draft//EN">
<HTML>
<HEAD>
<TITLE>Error 500--Internal Server Error</TITLE>
<META NAME="GENERATOR" CONTENT="WebLogic Server">
</HEAD>
<BODY bgcolor="white">
<FONT FACE=Helvetica><BR CLEAR=all>
<TABLE border=0 cellspacing=5><TR><TD><BR CLEAR=all>
<FONT FACE="Helvetica" COLOR="black" SIZE="3"><H2>Error 500--Internal Server Error</H2>
</FONT></TD></TR>
</TABLE>
<TABLE border=0 width=100% cellpadding=10><TR><TD VALIGN=top WIDTH=100% BGCOLOR=white><FONT FACE="Courier New"><FONT FACE="Helvetica" SIZE="3"><H3>From RFC 2068 <i>Hypertext Transfer Protocol -- HTTP/1.1</i>:</H3>
</FONT><FONT FACE="Helvetica" SIZE="3"><H4>10.5.1 500 Internal Server Error</H4>
</FONT><P><FONT FACE="Courier New">The server encountered an unexpected condition which prevented it from fulfilling the request.</FONT></P>
</FONT></TD></TR>
</TABLE>
</BODY>
</HTML>
This question is a few months old now, but better late than never:
The Eurostat REST API you are talking is supposed to respond with XML content, which urllib is not expecting/allowing by default. The solution is to add a header Accept: application/xml to the request.
This will do the trick in Python 2.7 (using urllib2 by the way):
import urllib2
req = urllib2.Request("http://ec.europa.eu/eurostat/SDMX/diss-web/rest/data/"
"lfsq_egais/Q.T.Y_GE15.EMP..NL")
req.add_header("Accept", "application/xml")
response = urllib2.urlopen(req)
print response.read()
See urllib2 docs for more info and examples.

valid xhtml meta tags for facebook open-graph and twitter cards

to make valid:
xhtml
twitter cards
facebook-graph-api
for http://www.theyact.com/acting-classes/los-angeles/
I've managed to get my code to come up valid everywhere...
save 1 error on
http://validator.w3.org/
there is no attribute "property"
but only 1 instance among the many in the code, only the below seems to ruffle the validator's feathers:
<meta name="og:description" property="og:description" content="...
I'd like the code to be completely valid in validator.w3.org's eyes. What am I missing?
If you remove this element, the validator will complain about the next one containing the property attribute.
The property attribute is part of RDFa, but your DOCTYPE doesn’t allow the use of RDFa.
If you want to keep using XHTML 1.1, you could change it to:
for RDFa 1.0: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.0//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-1.dtd">
for RDFa 1.1: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML+RDFa 1.1//EN" "http://www.w3.org/MarkUp/DTD/xhtml-rdfa-2.dtd">
Or simply switch to (X)HTML5, which comes with RDFa 1.1 support.

Can I use an <img> tag to send cookies across domains?

Look at this situation:
www.websitea.com displays an img tag with a src attribute of www.websiteb.com/image.aspx?id=5 and style="display:none"
www.websiteb.com returns an clear image, in addition to a cookie with a name of referrer and value of 5 (created server-side from validated querystring.)
Would the cookie be created on domain www.websitea.com or www.websiteb.com?
Currently I'm sure a series of redirects with querystrings and to achieve cross-domain cookies, but I came up with this image idea a little ago. I guess I could also use an iframe.
Thanks!
Check out:
cross-domain-user-tracking
Someone mentions using a 1x1 image for tracking across domains.
The cookie would be created for websiteb.com.
The cookie is created from the request to websiteb.com so yea... the cookie goes to websiteb scope
You're on the right track. As others have mentioned, the cookie would be created for websiteb.com.
To overcome issues with IE you'll probably need to ad a Compact Privacy policy.
Start here: http://msdn.microsoft.com/en-us/library/ms537342.aspx and Google for the rest.
Ok looks good. Tested in all browsers. Added a P3P tag for IE6, not sure if it was necessary though.
<%# Page Language="VB" %>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<script runat="server">
Protected Sub Page_Load(ByVal sender As Object, ByVal e As System.EventArgs)
Response.AddHeader("P3P", "CP=""CAO PSA OUR""")
Dim passedlocalizeID As String = Request.QueryString("id")
Dim localizeID As Integer
If passedlocalizeID IsNot Nothing AndAlso Int32.TryParse(passedlocalizeID, localizeID) Then
Dim localizer As New Localizer
localizer.LocalizeTo(localizeID)
End If
End Sub
</script>
<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
<title>Redirecting . . .</title>
<meta http-equiv="refresh" content="0;URL=/" />
</head>
<body>
<form id="form1" runat="server">
<div>
</div>
</form>
</body>
</html>