just another question about encoding in python i think. I have this programm:
regex = re.compile(ur'\b[sw]\w+', flags= re.U | re.I)
ergebnisliste = []
for line in fileobject:
print str(line)
erg = regex.findall(line)
ergebnisliste = ergebnisliste + erg
ergebnislistesortiert = sorted(ergebnisliste, key=lambda x: len(x))
print ergebnislistesortiert
fileobject.close()
I am searching a textfile for words beginning with s or w. My "ergebnislistesortiert" is the sorted result list.
I will print the result list and there appers to be a problem with the encoding:
['so', 'Wer', 'sp\xc3']
the 'sp\xc3' should be print as spät. What is wrong here? Why is the list element utf-8?
And how can i get the right decoding to print "spät"?
Thanks a lot guys!
\xc3 is not UTF-8. It's a fragment of the full UTF-8 encoding of U+00E4 but you're probably reading it with something like a Latin-1 decoder (which is effectively what Python 2 does if you read bytes without specifying an encoding), in which case the second byte in the UTF-8 sequence isn't matched by \w.
The real fix is to decode the data when you are reading it into Python in the first place. If you are writing new code, switching to Python 3 is probably the best and easiest fix.
If you're stuck on Python 2.7, a somewhat Python 3-compatible approach is something like
import io
fileobject = io.open(filename, encoding='utf-8')
If you have control over the input file and want to postpone the proper solution until you are older, (ask your parents for permission to) convert the UTF-8 input file to some legacy 8-bit encoding.
Related
I am trying to build a tool that can convert .csv files into .yaml files for further use. I found a handy bit of code that does the job nicely from the link below:
Convert CSV to YAML, with Unicode?
which states that the line will take the dict created by opening a .csv file and dump it to a .yaml file:
out_file.write(ry.safe_dump(dict_example,allow_unicode=True))
However, one small kink I have noticed is that when it is run once, the generated .yaml file is typically incomplete by a line or two. In order to have the .csv file exhaustively read through to create a complete .yaml file, the code must be run two or even three times. Does anybody know why this could be?
UPDATE
Per request, here is the code I use to parse my .csv file, which is two columns long (with a string in the first column and a list of two strings in the second column), and will typically be 50 rows long (or maybe more). Also note that it designed to remove any '\n' or spaces that could potentially cause problems later on in the code.
csv_contents={}
with open("example1.csv", "rU") as csvfile:
green= csv.reader(csvfile, dialect= 'excel')
for line in green:
candidate_number= line[0]
first_sequence= line[1].replace(' ','').replace('\r','').replace('\n','')
second_sequence= line[2].replace(' ','').replace('\r','').replace('\n','')
csv_contents[candidate_number]= [first_sequence, second_sequence]
csv_contents.pop('Header name', None)
Ultimately, it is not that important that I maintain the order of the rows from the original dict, just that all the information within the rows is properly structured.
I am not sure what would cause could be but you might be running out of memory as you create the YAML document in memory first and then write it out. It is much better to directly stream it out.
You should also note that the code in the question you link to, doesn't preserve the order of the original columns, something easily circumvented by using round_trip_dump instead of safe_dump.
You probably want to make a top-level sequence (list) as in the desired output of the linked question, with each element being a mapping (dict).
The following parses the CSV, taking the first line as keys for mappings created for each following line:
import sys
import csv
import ruamel.yaml as ry
import dateutil.parser # pip install python-dateutil
def process_line(line):
"""convert lines, trying, int, float, date"""
ret_val = []
for elem in line:
try:
res = int(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = float(elem)
ret_val.append(res)
continue
except ValueError:
pass
try:
res = dateutil.parser.parse(elem)
ret_val.append(res)
continue
except ValueError:
pass
ret_val.append(elem.strip())
return ret_val
csv_file_name = 'xyz.csv'
data = []
header = None
with open(csv_file_name) as inf:
for line in csv.reader(inf):
d = process_line(line)
if header is None:
header = d
continue
data.append(ry.comments.CommentedMap(zip(header, d)))
ry.round_trip_dump(data, sys.stdout, allow_unicode=True)
with input xyz.csv:
id, title_english, title_russian
1, A Title in English, Название на русском
2, Another Title, Другой Название
this generates:
- id: 1
title_english: A Title in English
title_russian: Название на русском
- id: 2
title_english: Another Title
title_russian: Другой Название
The process_line is just some sugar that tries to convert strings in the CSV file to more useful types and strings without leading spaces (resulting in far less quotes in your output YAML file).
I have tested the above on files with 1000 rows, without any problems (I won't post the output though).
The above was done using Python 3 as well as Python 2.7, starting with a UTF-8 encoded file xyz.csv. If you are using Python 2, you can try unicodecsv if you need to handle Unicode input and things don't work out as well as they did for me.
Hi guys I am having a problem with inserting utf-8 unicode character to my database.
The unicode that I get from my form is u'AJDUK MARKO\u010d'. Next step is to decode it to utf-8. value.encode('utf-8') then I get a string 'AJDUK MARKO\xc4\x8d'.
When I try to update the database, works the same for insert btw.
cur.execute( "UPDATE res_partner set %s = '%s' where id = %s;"%(columns, value, remote_partner_id))
The value gets inserted or updated to the database but the problem is it is exactly in the same format as AJDUK MARKO\xc4\x8d and of course I want AJDUK MARKOČ. Database has utf-8 encoding so it is not that.
What am I doing wrong? Surprisingly couldn't really find anything useful on the forums.
\xc4\x8d is the UTF-8 encoding representation of Č. It looks like the insert has worked but you're not printing the result correctly, probably by printing the whole row as a list. I.e.
>>> print "Č"
"Č"
>>> print ["Č"] # a list with one string
['\xc4\x8c']
We need to see more code to validate (It's always a good idea to give as much reproducible code as possible).
You could decode the result (result.decode("utf-8")) but you should avoid manually encoding or decoding. Psycopg2 already allows you send Unicodes, so you can do the following without encoding first:
cur.execute( u"UPDATE res_partner set %s = '%s' where id = %s;" % (columns, value, remote_partner_id))
- note the leading u
Psycopg2 can return Unicodes too by having strings automatically decoded:
import psycopg2
import psycopg2.extensions
psycopg2.extensions.register_type(psycopg2.extensions.UNICODE)
psycopg2.extensions.register_type(psycopg2.extensions.UNICODEARRAY)
Edit:
SQL values should be passed as an argument to .execute(). See the big red box at: http://initd.org/psycopg/docs/usage.html#the-problem-with-the-query-parameters
Instead
E.g.
# Replace the columns field first.
# Strictly we should use http://initd.org/psycopg/docs/sql.html#module-psycopg2.sql
sql = u"UPDATE res_partner set {} = %s where id = %s;".format(columns)
cur.execute(sql, (value, remote_partner_id))
I'm running a parser in python 2.7 that is taking a text field of xml code from a database and using Beautiful Soup to find and pull different tags in the xml. When I am pulling the tags from an tag in the xml and getting to the given text it is returning
<author>
<name>Josef Šimánek</name>
</author>
Josef \xc5\xa0im\xc3\xa1nek
when what it should look like is
Josef Šimánek
my relevant code is as follows:
rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
author = entry.find('author')
if author != None:
for name in author.findAll("name"):
if(checkNull(name).find(",") != -1):
name = checkNull(name).split(",",1)
for s in name:
print s
else:
print name
As you can see the code pulls out and cycles through the different tags and if the name tag contains a comma separated list of names, then it splits and prints each individually.
def checkNull(item):
if item != None:
return item.text.rstrip()
return " "
Also the check null function is just a helper method to see if the returned tag even contains any text at all as seen above.
I have tried encode, decode, and unicode functions in order to try and resolve the issue however none have succeded. Are there any other methods recommended that i could try to fix this?
name is a BeautifulSoup.Tag type not a string so you're probably getting a __repr__ of the object that's suitable for a terminal that doesn't support UTF-8 (\xc5\xa0 is the Python byte sequence for the UTF-8 encoding of š). name.text is probably the value you actually want, which should be a Unicode string.
If you're using Windows, it's best to avoid printing to the console as its console doesn't easily support UTF-8. You could use https://pypi.python.org/pypi/win_unicode_console, but it's easier to just write your output to a file instead.
I've cleaned up your code a little to make it simpler (quick null checks) and to write your output to a UTF-8 encoded file:
# io provides better access to files with working universal newline support
import io
# open a file in text mode, encoding all output to utf-8
output_file = io.open("output.txt", "w", encoding="utf-8")
rss = str(f)
soup = BeautifulSoup(rss)
entries = soup.findAll('entry')
for entry in entries:
author = entry.find('author')
# If not null or not empty
if author:
for name in author.findAll("name"):
# .text contains the actual Unicode string value
if name.text:
names = name.text.split(",", 1)
# If string contained a comma, you'll have two elements in a list
# else you'll just have the 1 length list
for flname in names:
# remove any whitespace on either side
output_file.write(flname.strip() + "\n")
output_file.close()
I have a large text file which mainly consists of numbers and some delimiters like ,|{}[]: etc. I used Lempel-Ziv encoding for compression. The code I used is not mine and is the one from Rosetta code. I ran the code for line by line compression as well as once for chunk by chunk compression:
def readChunk(file_object, size = 1024):
while True:
data = file_object.read(size)
if not data:
break
yield data
def readByChunk():
with open(LARGE_FILE, 'r') as f:
for data in readChunk(f, 2048):
compressed_chunk = compress(data)
compressed_chunk = map(lambda a : str(a), compressed_chunk)
comp_file.write(" ".join(compressed_chunk))
def readLineByLine():
with open(LARGE_FILE, 'r') as f:
lines = f.readlines()
for data in lines:
compressed_line = compress(data)
compressed_line = map(lambda a : str(a), compressed_line)
comp_file.write(" ".join(compressed_line))
Both function output a file that is much bigger than the original file!! Decompression works fine i.e. I am able to get the original text back so I think the code is correct.
Am I doing something wrong in saving the file?
The compressor you are using is terrible. Try zlib.compress instead.
The general answer is "when the data is random bits", or already compressed. 99% of other normal things will compress just fine. For ascii data (like the data you say you are using) really trivial compressors suffice, just Huffman encoding it gets you a decent boost and you're saying you only use like a dozen unique characters.
Which means that either you have a bunch of random data that you're not telling us about or there's a bug in the compressor.
I am writing a Bittorrent client. One of the steps involved requires that the program sends a HTTP GET request to the tracker containing an SHA1 hash of part of the torrent file. I have used Fiddler2 to intercept the request sent by Azureus to the tracker.
The hash that Azureus sends is URL-Encoded and looks like this: %D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR
The hash should look like this before it's URL-Encoded: d90c3ce39418f0c5d98358e03349262b608cbf52
I notice that it is not as simple as placing a '%' symbol every two characters, so how would I go about encoding this BYTE string to get the same as Azureus.
Thanks in advance.
Actually, you can just place a % symbol every two characters. Azureus doesn't do that because, for example, R is a safe character in a URL, and 52 is the hexadecimal representation of R, so it doesn't need to percent-encode it. Using %52 instead is equivalent.
Go through the string from left to right. If you encounter a %, output the next two characters, converting upper-case to lower-case. If you encounter anything else, output the ASCII code for that character in hex using lower-case letters.
%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R
The ASCII code for X is 0x58, so that becomes 58. The ASCII code for 3 is 0x33.
(I'm kind of puzzled why you had to ask though. Your question clearly shows that you recognized this as URL-Encoded.)
Even though I know well the original question was about C++, it might be useful somehow, sometimes to see alternative solutions. Therefore, for what it's worth (10 years later), here's
An alternative solution implemented in Python 3.6+
import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
# decode hex string as a Windows-1252 string
win1252_str = binascii.unhexlify(hex_str).decode(encoding)
# escape string and return
return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
# unescape the escaped string as a Windows-1252 string
win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
# encode string, hexlify, and return
return win1252_str.encode('Windows-1252').hex()
Two elementary tests:
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True
Note
Windows-1252 (aka cp1252) emerged as the default encoding as a result of the following test:
import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
chardet.detect(
binascii.unhexlify(hex_str)
)
)
...which gave a pretty strong clue:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}