Pisa arabic rendering issue using xhtml2pdf and reportlab - django

Arabic goes wrong displayed on output pdf with the following code:
template = get_template(template_src)
context = Context(context_dict)
html = template.render(context)
result = StringIO.StringIO()
pdf = pisa.pisaDocument(StringIO.StringIO(
html.encode(pdf_encoding)), result, encoding=pdf_encoding) # pdf_encoding = 'utf-8'
and in the template I set charset and fonts:
<meta http-equiv="content-type" content="text/html; charset=utf-8">
#font-face {
font-family: AmiriRegular;
src: url(/usr/share/fonts/opentype/fonts-hosny-amiri/amiri-regular.ttf);
}
body { font-family: AmiriRegular; }
the output:
The Arabic text in snapshot is supposed to be:
ياسر حسن

I think one solution is to pass those arabic text manually and reshape them using these two libraries:
import arabic_reshaper
from bidi.algorithm import get_display
reshaped_text = arabic_reshaper.reshape('your desired text in arabic')
bidi_text = get_display(reshaped_text)

Related

How to remove strings which does not belong to HTML tag in an HTML file

I have an HTML file which contains;
<html>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>a& ca-79069608498"
<div class="cont" id="aka"></div>
<footer>
<div class="tent"><div class="cont"></div>
<h2><img alt="dscdsc" height="18" src="dsc.png" srcset="" width="116"/></h2>
</div>
</footer>
ipt> (window.NORLQ=window.NORLQ||[]).push(function(){var
ns,i,p,img;ns=document.getElementsByTagName('noscript');for(i=0;i<ns.len)>-1){img=document.createEleight'));img.setAttribute('alt',p.getAttribute('data-alt'));p.parentNode.replaceChild(img,p);}}});/*]]>*/</script><script>(window.RLQ=window.RLQ||[]).push(function(
Name of the file is a.html
I want to remove everything after </html> in the HTML file using Python 2.7 but all the strings after HTML tag do not belong to a tag and some of them just noisy so I could not do it using Beautifulsoup and I don't know if it's smart to use regex for HTML file.
How can I remove strings after </html> and write to HTML file?
with regex
import re
...
newhtml = re.sub('</html>[\s\S.]+', '</html>', oldhtml)
a = open(path, "r").read()
b = a.split('</html>', 1)[0]
open(path, 'w').write(b)
Python has a module called HTMLParser for dealing this sort of problem.
While the proposed regexpr seem to handle your problem well for now, it can be problematic to debug when something went wrong when it cant handle edge cases HTML.
Therefore I am proposing a HTMLParser solution which gives you more control on its parsing behaviour.
Example:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
buffer = ""
end_of_html = False
def get_html(self):
return self.buffer
def handle_starttag(self, tag, attrs):
if not self.end_of_html:
value = "<" + tag
for attr in attrs:
value += attr[0] + "=" + attr[1]
self.buffer += value + ">"
def handle_data(self, data):
if not self.end_of_html:
self.buffer += data
def handle_endtag(self, tag):
if not self.end_of_html:
self.buffer += "</" + tag + ">"
if tag == "html":
self.end_of_html = True
parser = MyHTMLParser();
parser.feed("""<html>
</div>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>a& ca-79069608498"
<div class="cont" id="aka"></div>
<footer>
<div class="tent"><div class="cont"></div>
<h2><img alt="dscdsc" height="18" src="dsc.png" srcset="" width="116"/></h2>
</div>
</footer>
ipt> (window.NORLQ=window.NORLQ||[]).push(function(){var
ns,i,p,img;ns=document.getElementsByTagName('noscript');for(i=0;i<ns.len)>-1){img=document.createEleight'));img.setAttribute('alt',p.getAttribute('data-alt'));p.parentNode.replaceChild(img,p);}}});/*]]>*/</script><script>(window.RLQ=window.RLQ||[]).push(function(
""")
print parser.get_html()
Output:
<html>
</div>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>

finding href value in python

i'm working on project ,that searches the webpage content for some data
from lxml import html
import requests
def tabletPhone(webAddress):
page = requests.get(webAddress)
tree = html.fromstring(page.content)
product = tree.xpath("""//h1[#class="product_title entry-\
title"]/text()""")
price = tree.xpath("""//span[#class='price-number']/text()""")
availability = tree.xpath("//n:link",namespaces={'n':'availability'})
return product,price,availability
I have problem with finding availability of product , html code is somthing like :
<link itemprop="availability" href="http://schema.org/InStock" />
Is any way to return {'availability':'http://schema.org/InStock'} or return 'http://schema.org/InStock'
Using BeautifulSoup.We can achieve this.
from bs4 import BeautifulSoup
samplecode = '''Google <link itemprop="availability" href="http://schema.org/InStock"/> <link itemprop="availability" href="http://google.com"/>'''
soup = BeautifulSoup(samplecode,"lxml")
for line in soup.find_all(href=True):
print "Url-", line['href']
Above Code Snippet will work for all tags.it will search for href in all tags but if you want to search for a specific tag then use as follows:
for line in soup.find_all('link',href=True):
print "Url-", line['href']

Python webbrowser not opening browser

I'm trying to familiarize myself with Python with a major concern for web publishing, so I looked around and found the following example. Running 2.7 in PyScripter on Windows 7 didn't bring up the browser as I expected. The code came up in Notepad++ instead, apparently since the html suffix was associated with Notepad. I tried about a dozen different permutations of the code, but the html file still opened in Notepad until I associated that file with Firefox. When I include the print webbrowser._browsers command, I get {'windows-default': [, None], 'c:\program files (x86)\internet explorer\iexplore.exe': [None, ]}
This would imply to me that IE should be the default browser being called, but obviously it is not. Can anyone enlighten me here as I am a Python newbie?
'''A simple program to create an html file froma given string,
and call the default web browser to display the file.'''
contents = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="content-type">
<title>Hello</title>
</head>
<body>
Hello, World!
</body>
</html>
'''
import webbrowser
def main():
browseLocal(contents)
def strToFile(text, filename):
"""Write a file with the given name and the given text."""
output = open(filename,"w")
output.write(text)
output.close()
def browseLocal(webpageText, filename='C:\\Python27\\Programs\\tempBrowseLocal.html'):
'''Start your webbrowser on a local file containing the text
with given filename.'''
strToFile(webpageText, filename)
print webbrowser._browsers
webbrowser.open(filename)
main()
def browseLocal(webpageText):#take filename out of here
'''Start your webbrowser on a local file containing the text
with given filename.'''
#define filename here.
filename='C:\\Python27\\Programs\\tempBrowseLocal.html'
strToFile(webpageText, filename)
print webbrowser._browsers
webbrowser.open(filename)

Django - Pisa generated pdf doesn't have spaces

I'm using Django and my code to render the PDF is really typical:
t = loader.get_template('back/templates/content/receipt.html')
c = RequestContext(request, {
'pagesize': 'A4',
'invoice': invoice,
'plan': plan,
})
html = t.render(c)
result = StringIO.StringIO()
pdf = pisa.pisaDocument(StringIO.StringIO(html.encode("UTF-8")), result)
if not pdf.err:
return HttpResponse(result.getvalue(), mimetype="application/pdf")
And the receipt.html is nothing unusual:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
<title>Squizzal Receipt</title>
<style type="text/css">
#page {
size: {{pagesize}};
margin: 1cm;
word-spacing 1cm;
#frame footer {
-pdf-frame-content: footerContent;
bottom: 0cm;
margin-left: 9cm;
margin-right: 9cm;
height: 1cm;
}
}
</style>
</head>
<body>
<h1>Your Receipt</h1>
<<SNIP>>
but none of the spaces in the pdf are rendered. All the words are right next to each other. I've tried normal spaces and "&nbsp" and the result is the same. For example the above would appear as "YourReceipt" in the pdf.
When I try using the command line version of pisa, it generates the pdf just fine with spaces between the words.
Any thoughts?
I had this same issue and didn't want to force the PDF to be downloaded from the browser. This turned out to be a platform specific issue: Google Chrome's native PDF viewer plugin fails to render spaces in certain documents on certain Linux distros when Microsoft TrueType fonts are not installed. See http://www.google.com/support/forum/p/Chrome/thread?tid=7169b114e8ea33c7&hl=en for details.
I fixed this by simply running the following commands in bash (adjust for your distro; this was on Ubuntu):
$ sudo apt-get install msttcorefonts
(Accept the EULA during the install process)
$ fc-cache -fv
After restarting Chrome (important!), the native PDF viewer correctly displayed the PDF with spaces.
Ok thanks to akonsu the problem seems to be how Django's HttpResponse is being treated (either on the server side or on the browser side).
Instead of
return HttpResponse(result.getvalue(), mimetype="application/pdf")
Use:
resp = HttpResponse(result.getvalue(), mimetype="application/pdf")
resp['Content-Disposition'] = 'attachment; filename=receipt.pdf'
return resp
This at least produces a result without spaces. Still no idea why the first way wasn't working.

Problems using extended escape mode for jsoup output

I need to transform a HTML file, by removing certain tags from the file. To do this I have something like this -
import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Entities;
import org.jsoup.nodes.Entities.EscapeMode;
import java.io.IOException;
import java.io.File;
import java.util.*;
public class TestJsoup {
public static void main(String[] args) throws IOException {
Validate.isTrue(args.length == 1, "usage: supply url to fetch");
String url = args[0];
Document doc = null;
if(url.contains("http")) {
doc = Jsoup.connect(url).get();
} else {
File f = new File(url);
doc = Jsoup.parse(f, null);
}
/* remove some tags */
doc.outputSettings().escapeMode(Entities.EscapeMode.extended);
System.out.println(doc.html());
return;
}
}
The issue with the above code is that, when I use extended escape mode, the output has the html tag attributes being html encoded. Is there anyway to avoid this? Using escape mode as base or xhtml doesn't work as some of the non standard extended (like ’) encoding give problems. For ex for the HTML below,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>
<title>Test®</title>
</head>
<body style="background-color:#EDEDED;">
<P>
<font style="color:#003698; font-weight:bold;">Testing HTML encoding - ’ © with a link
</font>
<br />
</P>
</body>
</html>
The output I get is,
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html>
<head>&NewLine;
<title>Test®</title>&NewLine;
</head>&NewLine;
<body style="background-color&colon;&num;EDEDED&semi;">&NewLine;
<p>&NewLine; <font style="color&colon;&num;003698&semi; font-weight&colon;bold&semi;">Testing HTML encoding - &rsquor; © with a <a href="http&colon;&sol;&sol;www&period;g
oogle&period;com">link</a></font> <br />&NewLine;</p>&NewLine;&NewLine;&NewLine;&NewLine;
</body>
</html>
Is there anyway to get around this issue?
What output encoding character set are you using? (It will default to the input, which if you are loading from URLs, will vary according to the site).
You probably want to explicitly set it to either UTF-8, or ASCII or some other low setting if you are working with systems that cannot deal with UTF-8. If you set the escape mode to base (the default), and the charset to ascii, then any character (like rsquo) than cannot be represented natively in the selected charset will be output as a numerical escape.
For example:
String check = "<p>’ <a href='../'>Check</a></p>";
Document doc = Jsoup.parse(check);
doc.outputSettings().escapeMode(Entities.EscapeMode.base); // default
doc.outputSettings().charset("UTF-8");
System.out.println("UTF-8: " + doc.body().html());
doc.outputSettings().charset("ASCII");
System.out.println("ASCII: " + doc.body().html());
Gives:
UTF-8: <p>’ Check</p>
ASCII: <p>’ Check</p>
Hope this helps!