scraping chinese characters python

scraping chinese characters python - python-2.7

I learnt how to scrap website from https://automatetheboringstuff.com. I wanted to scrap http://www.piaotian.net/html/3/3028/1473227.html in which the contents is in chinese and write its contents into a .txt file. However, the .txt file contains random symbols which I assume is a encoding/decoding problem.
I've read this thread "how to decode and encode web page with python?" and figured the encoding method for my site is "gb2312" and "windows-1252". I tried decoding in those two encoding methods but failed.
Can someone kindly explain to me the problem with my code? I'm very new to programming so please let me know my misconceptions as well!
Also, when I remove the "html.parser" from the code, the .txt file turns out to be empty instead of having at least symbols. Why is this the case?
import bs4, requests, sys
reload(sys)
sys.setdefaultencoding("utf-8")
novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()
novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")
content = novelSoup.select("br")
novelFile = open("novel.txt", "w")
for i in range(len(content)):
novelFile.write(str(content[i].getText()))

novel = requests.get("http://www.piaotian.net/html/3/3028/1473227.html")
novel.raise_for_status()
novel.encoding = "GBK"
novelSoup = bs4.BeautifulSoup(novel.text, "html.parser")
out:
<br>
一元宗，坐落在青峰山上，绵延极长，现在是盛夏时节，天空之中，太阳慢慢落了下去，夕阳将影子拉的很长。<br/>
<br/>
一片不是很大的小湖泊边上，一个约莫着十七八岁的青衣少年坐在湖边，抓起湖边的一块石头扔出，顿时在湖边打出几朵浪花。<br/>
<br/>
叶希文有些茫然，他没想到，他居然穿越了，原本叶希文只是二十一世纪的地球上一个普通的大学生罢了，一个月了，他才后知后觉的反应过来，这不是有人和他进行恶作剧，而是，他真的穿越了。<br/>
Requests will automatically decode content from the server. Most
unicode charsets are seamlessly decoded.
When you make a request, Requests makes educated guesses about the
encoding of the response based on the HTTP headers. The text encoding
guessed by Requests is used when you access r.text. You can find out
what encoding Requests is using, and change it, using the r.encoding
property:
>>> r.encoding
'utf-8'
>>> r.encoding = 'ISO-8859-1'
If you change the encoding, Requests will use the new value of
r.encoding whenever you call r.text.

Related

read text file content with python at zapier

I have problems getting the content of a txt-file into a Zapier
object using https://zapier.com/help/code-python/. Here is the code I am
using:
with open('file', 'r') as content_file:
content = content_file.read()
I'd be glad if you could help me with this. Thanks for that!

David here, from the Zapier Platform team.
Your code as written doesn't work because the first argument for the open function is the filepath. There's no file at the path 'file', so you'll get an error. You access the input via the input_data dictionary.
That being said, the input is a url, not a file. You need to use urllib to read that url. I found the answer here.
I've got a working copy of the code like so:
import urllib2 # the lib that handles the url stuff
result = []
data = urllib2.urlopen(input_data['file'])
for line in data: # file lines are iterable
result.append(line) # keep each line, or parse, etc.
return {'lines': result}
The key takeaway is that you need to return a dictionary from the function, so make sure you somehow squish your file into one.
Let me know if you've got any other questions!

#xavid, did you test this in Zapier?
It fails miserably beacuse urllib2 doesn't exist in the zapier python environment.

pyPdf Splitting Large PDF fails after splitting 150-152 pages of the PDF

I have a function that takes in PDF file path as input and splits it into separate pages as shown below:
import os,time
from pyPdf import PdfFileReader, PdfFileWriter
def split_pages(file_path):
print("Splitting the PDF")
temp_path = os.path.join(os.path.abspath(__file__), "temp_"+str(int(time.time())))
if not os.path.exists(temp_path):
os.makedirs(temp_path)
inputpdf = PdfFileReader(open(file_path, "rb"))
if inputpdf.getIsEncrypted():
inputpdf.decrypt('')
for i in xrange(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open(os.path.join(temp_path,'%s.pdf'% i),"wb") as outputStream:
output.write(outputStream)
It works for small files but the problem is that It only splits for first 0-151 pages when the PDF has more than 152 pages and stops after that. It also sucks out all the memory of the system before I kill it.
Please let me know what I'm doing wrong or where the problem is occurring and how do I correct it?

It seems like the problem is with pyPdf itself. I switched to pyPDF2 and it worked.

Read multilanguage strings from html via Python 2.7

I am new in python 2.7 and I am trying to extract some info from html files. More specifically, I wand to read some text information that contains multilanguage information. I give my script hopping to make things more clear.
import urllib2
import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup.BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
print data[0]['content'].encode("utf-8")
the result I am taking is
BBCϊ╕φόΨΘύ╜ΣΎ╝Νϊ╕╗ώκ╡Ύ╝Νbbcchinese.com, email news, newsletter, subscription, full text
The problem is in the first string. Is there any way to print what exactly I am reading? Also is there any way to find the exact encoding of the language of each script?
PS: I would like to mention that the site selected totally randomly as it is representative to the problem I am encountering.
Thank you in advance!

You have problem with the terminal where you are outputting the result. The script works fine and if you output data to file you will get it correctly.
Example:
import urllib2
from bs4 import BeautifulSoup
url = 'http://www.bbc.co.uk/zhongwen/simp/'
page = urllib2.urlopen(url).read().decode("utf-8")
dom = BeautifulSoup(page)
data = dom.findAll('meta', {'name' : 'keywords'})
with open("test.txt", "w") as myfile:
myfile.write(data[0]['content'].encode("utf-8"))
test.txt:
BBC中文网，主页，bbcchinese.com, email news, newsletter, subscription, full text
Which OS and terminal you are using?

Django - Decoding MIME Header with Base64 and UTF-8

I am creating a web email interface to read IMAP accounts. I'm having problems decoding a certain email header.
I obtain the following From header (specific example from an event email):
('"=?UTF-8?B?QmVubnkgQmVuYXNzaQ==?=" <NOREPLY#NOREPLY.LOCKNLOADEVENTS.COM>', None)
I separate the first part:
=?UTF-8?B?QmVubnkgQmVuYXNzaQ==?=
According to some research, it's aparently a Base64-encoded UTF-8 header.
I tried to decode it using the Base64 decoder:
# Separate sender name from email itself
first_part = header_text[1:header_text.index('" <')]
print "First part:", first_part
import base64
decoded_first_part = base64.urlsafe_b64decode(first_part)
print decoded_first_part
But I obtain a
TypeError: Incorrect padding.
Can anybody help me figure out what's wrong?
Thank you

>>> import base64
>>> base64.decodestring('QmVubnkgQmVuYXNzaQ==')
'Benny Benassi'
But you probably want to use a proper IMAP library for doing this stuff.

Django csv output in Excel

Hi I have a simple view which returns a csv file of a queryset which is generated from a mysql db using utf-8 encoding:
def export_csv(request):
...
response = HttpResponse(mimetype='text/csv')
response['Content-Disposition'] = 'attachment; filename=search_results.csv'
writer = csv.writer(response, dialect=csv.excel)
for item in query_set:
writer.writerow(smart_str(item))
return response
return render(request, 'search_results.html', context)
This works fine as a CSV file, and can be opened in text editors, LibreOffice etc. without problem.
However, I need to supply a file which can be opened in MS Excel in Windows without errors. If I have strings with latin characters in the queryset such as 'Española' then the output in Excel is 'EspaÃ±ola'.
I tried this blogpost but it didn't help. I also know abut the xlwt package, but I am curious if there is a way of correcting the output, using the CSV method I have at the moment.
Any help much appreciated.

Looks like there is not a uniform solution for all version of Excel.
Your best bet migth be to go with openpyxl, but this is rather complicated and requiers
separate handling of downloads for excel users which is not optimal.
Try adding byte order marks at the beginnign (0xEF, 0xBB, 0xBF) of file. See microsoft-excel-mangles-diacritics-in-csv-files
There is another similar post.

You might give python-unicodecsv a go. It replaces the python csv module which doesn't handle Unicode too gracefully.
Put the unicodecsv folder somehwere you can import it or install via setup.py
Import it into your view file, eg :
import unicodecsv as csv

I found out there are 3 things to do for Excel to open unicode csv files properly:
Use utf-16-le charset
Insert utf-16 byte order mark to the beginning of exported file
Use tabs instead of commas in csv
So, this should make it work in Python 3.7 and Django 2.2
import codecs
...
def export_csv(request):
...
response = HttpResponse(content_type='text/csv', charset='utf-16-le')
response['Content-Disposition'] = 'attachment; filename=search_results.csv'
response.write(codecs.BOM_UTF16_LE)
writer = csv.writer(response, dialect='excel-tab')
for item in query_set:
writer.writerow(smart_str(item))
return response

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

scraping chinese characters python - python-2.7

Related

read text file content with python at zapier

pyPdf Splitting Large PDF fails after splitting 150-152 pages of the PDF

Read multilanguage strings from html via Python 2.7

Django - Decoding MIME Header with Base64 and UTF-8

Django csv output in Excel

Categories

Resources