flask's send_file truncating burmese characters - flask

I've created a pdf file from a template then edited it via pdfrw and I wanted to return the output pdf to the user. The problem is, the user of my form will be from Myanmar(using Burmese Characters), the output pdf file is displaying the Burmese characters correctly but it seems that flask's send_mail is truncating/not displaying the fields with values in Burmese. I tried to input characters in Japanese, Korean, and Chinese, so far these characters are displaying correctly. Tried also with Hindi (Indian Characters) with the same result with Burmese(truncated).
pdf_file = create_pdf(form, 'claim-form')
return send_file(os.path.realpath(pdf_file.name), download_name='python.pdf',\
as_attachment=True, mimetype="Content-Type: application/pdf; charset=UTF-8")

Related

Special character encoding added - PDF Django

I have a function to create a simple PDF. But when working on special characters, it returns something like that. How do I correctly save characters such as śćźż in my pdf file?
I tried to change the font type using setFont (Helvetica, TimesRoman) according this doc but I was not able to get the expected results.
Views.py (oficial doc)
def some_view_aa(request):
# Create a file-like buffer to receive PDF data.
buffer = io.BytesIO()
# Create the PDF object, using the buffer as its "file."
p = canvas.Canvas(buffer)
# Draw things on the PDF. Here's where the PDF generation happens.
# See the ReportLab documentation for the full list of functionality.
p.drawString(100, 100, "Hello AZX AĄĄŻĄ world.")
# Close the PDF object cleanly, and we're done.
p.showPage()
p.save()
# FileResponse sets the Content-Disposition header so that browsers
# present the option to save the file.
buffer.seek(0)
return FileResponse(buffer, as_attachment=True, filename='hello.pdf')

How to have my path and query parameters from my Post webservices be in UTF-8 encoding and display Chinese character correctly in the backend in Java

I have this post webservice that takes in the query parameter that we make a string called newStationName:
#POST
#Path("/station/{stationid}/")
public void setStationOptions (#PathParam("stationid")Integer stationID,
#QueryParam("stationname") String newStationName,
)
{
The problem with that query parameter is that if someone passes a name for the station in Chinese, it then it shows up on the framework end in latin-1 encoding or ISO8859-1 encoding and looks like a bunch of garbled text. The way I've gotten it to display correctly is by getting the strings bytes and changing it from latin-1 to utf-8 like this:
try {
decodedNewStationName = new String(newStationName.getBytes("ISO8859-1"), "utf-8");
}
catch (UnsupportedEncodingException e) {
log.error("Can't decode newStationName");
e.printStackTrace();
}
I would like to find a global solution for this so that every time we receive a user inputted Chinese string from our web app on any webservice, we don't need to put this try catch block there as well.
I've tried playing with our tomcat and jersery server filters and encoding and that hasn't worked. I've also tried making the request and response encoding in utf-8 and that hasn't worked. I've also tried encoding the parameter of the url in utf-8, but that just sends the string back in url utf-8 that looks like this: "%C%A%D etc..." and then that needs to be decoded.
I haven't been able to find anything that has worked globally to this point but I feel that there has to be something I'm missing.
I have also edited the Connectors in the server.xml file to have their URI Encoding to UTF-8 as well as the server.xml file encoding itself in utf-8.
<Connector URIEncoding="UTF-8" port="8080"
protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443" />
I have also changed the encoding of the web.xml file to be in utf-8 as well, and the character encoding for the HttpServletRequest and Response are in utf-8

How can a scrape a page that literally contains "\x2d", but save that character as "-" in my item?

I need to scrape some text from within a script on a page, and save that text within a scrapy item, presumably as a UTF-8 string. However the actual literal text I'm scraping from has special characters written out as what I believe to be UTF hex. e.g. "-" is written as "\x2f". How can I scrape characters represented as "\x2f" but save them as "-" in my scrapy item?
Excerpt of contents on scraped page:
<script type="text/javascript">
[approx 100 various lines of script, omitted]
"author": "Kurt\x20Vonnegut",
"internetPrice": "799",
"inventoryType": "new",
"title": "Slaughterhouse\x2DFive",
"publishedYear": "1999",
[approx 50 additional various lines of script, removed]
</script>
My scrapy script goes like this:
pattern_title = r'"title": "(.+)"'
title_raw = response.xpath('//script[#type="text/javascript"]').re(pattern_title)
item['title'] = title_raw[0]
For this item, scrapy's output will return:
'author': u'Kurt\x20Vonnegut', 'title': u'Slaughterhouse\x2DFive'
Ideally, I would like:
'author': 'Kurt Vonnegut', 'title': 'Slaughterhouse Five'
Things I've tried with no change to the output:
Change last line to: item['title'] = title_raw[0].decode('utf-8')
Change last line to: item['title'] = title_raw[0].encode('latin1').decode('utf-8')
Finally, in case it needs to be explicitly stated, I do not have control over how this information is being displayed on the site I'm scraping.
Inspired by Converting \x escaped string to UTF-8, I solved this by using .decode('string-escape'), as follows:
pattern_title = r'"title": "(.+)"'
title_raw = response.xpath('//script[#type="text/javascript"]').re(pattern_title)
title_raw[0] = title_raw[0].decode('string-escape')
item['title'] = title_raw[0]
You can use urllib's unquote function.
On Python 3.x:
from urllib.parse importe unquote
unquote("Kurt\x20Vonnegut")
On Python 2.7:
from urllib import unquote
unquote("Kurt\x20Vonnegut")
Take a look on Item Loaders and Input Processors so you can do this for all scraped fields.

scraping chinese characters python

I learnt how to scrap website from https://automatetheboringstuff.com. I wanted to scrap http://www.piaotian.net/html/3/3028/1473227.html in which the contents is in chinese and write its contents into a .txt file. However, the .txt file contains random symbols which I assume is a encoding/decoding problem.
I've read this thread "how to decode and encode web page with python?" and figured the encoding method for my site is "gb2312" and "windows-1252". I tried decoding in those two encoding methods but failed.
Can someone kindly explain to me the problem with my code? I'm very new to programming so please let me know my misconceptions as well!
Also, when I remove the "html.parser" from the code, the .txt file turns out to be empty instead of having at least symbols. Why is this the case?
import bs4, requests, sys
reload(sys)
sys.setdefaultencoding("utf-8")
novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()
novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")
content = novelSoup.select("br")
novelFile = open("novel.txt", "w")
for i in range(len(content)):
novelFile.write(str(content[i].getText()))
novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()
novel.encoding = "GBK"
novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")
out:
<br>
一元宗,坐落在青峰山上,绵延极长,现在是盛夏时节,天空之中,太阳慢慢落了下去,夕阳将影子拉的很长。<br/>
<br/>
一片不是很大的小湖泊边上,一个约莫着十七八岁的青衣少年坐在湖边,抓起湖边的一块石头扔出,顿时在湖边打出几朵浪花。<br/>
<br/>
叶希文有些茫然,他没想到,他居然穿越了,原本叶希文只是二十一世纪的地球上一个普通的大学生罢了,一个月了,他才后知后觉的反应过来,这不是有人和他进行恶作剧,而是,他真的穿越了。<br/>
Requests will automatically decode content from the server. Most
unicode charsets are seamlessly decoded.
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text. You can find out
what encoding Requests is using, and change it, using the r.encoding
property:
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'
If you change the encoding, Requests will use the new value of
r.encoding whenever you call r.text.

Django csv output in Excel

Hi I have a simple view which returns a csv file of a queryset which is generated from a mysql db using utf-8 encoding:
def export_csv(request):
...
response = HttpResponse(mimetype='text/csv')
response['Content-Disposition'] = 'attachment; filename=search_results.csv'
writer = csv.writer(response, dialect=csv.excel)
for item in query_set:
writer.writerow(smart_str(item))
return response
return render(request, 'search_results.html', context)
This works fine as a CSV file, and can be opened in text editors, LibreOffice etc. without problem.
However, I need to supply a file which can be opened in MS Excel in Windows without errors. If I have strings with latin characters in the queryset such as 'Española' then the output in Excel is 'Española'.
I tried this blogpost but it didn't help. I also know abut the xlwt package, but I am curious if there is a way of correcting the output, using the CSV method I have at the moment.
Any help much appreciated.
Looks like there is not a uniform solution for all version of Excel.
Your best bet migth be to go with openpyxl, but this is rather complicated and requiers
separate handling of downloads for excel users which is not optimal.
Try adding byte order marks at the beginnign (0xEF, 0xBB, 0xBF) of file. See microsoft-excel-mangles-diacritics-in-csv-files
There is another similar post.
You might give python-unicodecsv a go. It replaces the python csv module which doesn't handle Unicode too gracefully.
Put the unicodecsv folder somehwere you can import it or install via setup.py
Import it into your view file, eg :
import unicodecsv as csv
I found out there are 3 things to do for Excel to open unicode csv files properly:
Use utf-16-le charset
Insert utf-16 byte order mark to the beginning of exported file
Use tabs instead of commas in csv
So, this should make it work in Python 3.7 and Django 2.2
import codecs
...
def export_csv(request):
...
response = HttpResponse(content_type='text/csv', charset='utf-16-le')
response['Content-Disposition'] = 'attachment; filename=search_results.csv'
response.write(codecs.BOM_UTF16_LE)
writer = csv.writer(response, dialect='excel-tab')
for item in query_set:
writer.writerow(smart_str(item))
return response