I am using python2 and I want to convert the non utf-8 text into readable string. I am trying decode using latin-1 and utf-8 also. But I am getting no success.
This is the string
s = ' ¤¿à¤²à¤¾ मेंदान रोड़ इंदौर'
I tried:
s.decode('utf-8')
I am getting following output:
u' \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xbf\xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xb2\xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xbe \xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xae\xc3\u0192\xc2 \xc3\u201a\xc2\xa5\xc3\xa2\xc2\u20ac\xc2\xa1\xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\xa2\xc2\u20ac\xc2\u0161\xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xa6\xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xbe\xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xa8 \xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xb0\xc3\u0192\xc2 \xc3\u201a\xc2\xa5\xc3\xa2\xc2\u20ac\xc2\xb9\xc3\u0192\xc2 \xc3\u201a\xc2\xa5\xc3\u2026\xc2\u201c \xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\xa2\xc2\u20ac\xc2\xa1\xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\xa2\xc2\u20ac\xc2\u0161\xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xa6\xc3\u0192\xc2 \xc3\u201a\xc2\xa5\xc3\u2026\xc2\u2019\xc3\u0192\xc2 \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xb0'
The above output is still non-readable.
Any help would be highly appreciated
First thing first: you MUST read this, no excuse.
Once done, you'll understand that trying to decode a byte string to unicode without knowing the original encoding is mostly a waste of time.
Second point: this (shortened for readability):
u' \xc3\u201a\xc2\xa4\xc3\u201a\xc2\xbf\xc3\u0192\xc2 '
is Python's inner representation of the unicode string - you'd get something similar by displaying the inner representation of your byte string:
$ python
Python 2.7.15rc1 (default, Apr 15 2018, 21:51:34)
[GCC 7.3.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> s = "¤¿à¤²ÃÂ"
>>> s # internal representation
'\xc3\x83\xe2\x80\x9a\xc3\x82\xc2\xa4\xc3\x83\xe2\x80\x9a\xc3\x82\xc2\xbf\xc3\x83\xc6\x92\xc3\x82 \xc3\x83\xe2\x80\x9a\xc3\x82\xc2\xa4\xc3\x83\xe2\x80\x9a\xc3\x82\xc2\xb2\xc3\x83\xc6\x92\xc3\x82'
>>> print(s) # readable output
¤¿à¤²ÃÂ
So your only issue here is confusing the internal representation with the "human-visible" output
Now note that how the string will ultimately be displayed to the user depends on the software doing the rendering (your xterm or equivalent if running python from command line and printing to stdout, your browser if rendering this as part of a server-side generated HTTP response, etc etc) and system settings, all of which are outside Python's responsabilities.
Related
I tried to use both openslide and pyvips and my application doesn't find the necesary .dll. I think it is a problem of using both librarys.
I have read that pyvips has openslide embed but I can't find how to use it. The main purpose for this is to read Whole Slide Images and see the different levels and augmentations, and work with them.
I'd really appreciate your help! Thank you
Yes, pyvips usually includes openslide, so you can't use both together.
Use .get_fields() to see all the metadata on an image, for example:
$ python3
Python 3.9.7 (default, Sep 10 2021, 14:59:43)
[GCC 11.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pyvips
>>> x = pyvips.Image.new_from_file("openslide/CMU-1.svs")
>>> x.width
46000
>>> x.height
32914
>>> x.get_fields()
['width', 'height', 'bands', 'format', 'coding', 'interpretation', 'xoffset', 'yoffset',
'xres', 'yres', 'filename', 'vips-loader', 'slide-level', 'aperio.AppMag', 'aperio.Date',
'aperio.Filename', 'aperio.Filtered', 'aperio.Focus Offset', 'aperio.ICC Profile',
'aperio.ImageID', 'aperio.Left', 'aperio.LineAreaXOffset', 'aperio.LineAreaYOffset',
...
pyvips will open base level of the image by default (the largest), use level= to pick other levels, perhaps:
>>> x = pyvips.Image.new_from_file("openslide/CMU-1.svs", level=2)
>>> x.width
2875
See the docs for details:
https://www.libvips.org/API/current/VipsForeignSave.html#vips-openslideload
I would like to decode MACCYRILLIC code, for example "%EE%F2_%E4%EE%E1%F0%E0_%E4%EE%E1%F0%E0_%ED%E5_%E8%F9%F3%F2". How can I do it using Python2?
phrase.decode("MACCYRILLIC") has no effect.
urllib — Open arbitrary resources by URL
urllib.unquote(string)
Replace %xx escapes by their single-character equivalent.
Example: unquote('/%7Econnolly/') yields '/~connolly/'.
==> py -2
Python 2.7.12 (v2.7.12:d33e0cf91556, Jun 27 2016, 15:24:40) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>>
>>> import urllib
>>> MACCYRILLIC = "%EE%F2_%E4%EE%E1%F0%E0_%E4%EE%E1%F0%E0_%ED%E5_%E8%F9%F3%F2"
>>> print urllib.unquote(MACCYRILLIC).decode('cp1251')
от_добра_добра_не_ищут
>>>
Edit. Another approach (step by step):
==> py -2
Python 2.7.12 (v2.7.12:d33e0cf91556, Jun 27 2016, 15:24:40) [MSC v.1500 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib
>>> import codecs
>>> MACCYRILLIC = '%EE%F2_%E4%EE%E1%F0%E0_%E4%EE%E1%F0%E0_%ED%E5_%E8%F9%F3%F2'
>>> #print
... x = urllib.unquote(MACCYRILLIC) #.decode('cp1251')
>>> print repr(x)
'\xee\xf2_\xe4\xee\xe1\xf0\xe0_\xe4\xee\xe1\xf0\xe0_\xed\xe5_\xe8\xf9\xf3\xf2'
>>> y = codecs.decode(x, 'cp1251')
>>> print y
от_добра_добра_не_ищут
>>>
All above would work on the following requirement:
>>> import sys
>>> sys.stdout.encoding
'utf-8'
>>> print sys.stdout.encoding
utf-8
>>>
Unfortunately, the example in http://rextester.com/XAX79891 shows sys.stdout.encoding None (and I don't know a way of changing it to utf-8). Read more in Lennart Regebro's answer to Stdout encoding in python:
A better generic solution under Python 2 is to treat stdout as what
it is: An 8-bit interface. And that means that anything you print to
to stdout should be 8-bit. You get the error when you are trying to
print Unicode data, because print will then try to encode the Unicode
data to the encoding of stdout, and if it's None it will assume
ASCII, and fail, unless you set PYTHONIOENCODING.
I'm a bit lost on how to extract coordinates (Lat, Long) from a URL in Python.
Always I'll recive a url like this:
https://www.testweb.com/cordi?ll=41.403781,2.1896&z=17&pll=41.403781,2.1896
Where I need to extract the second set of this URL (in this case: 41.403781,2.1896) Just to say, that not always the first and second set of coords will be the same.
I know, that can be done with some regex, but I'm not good enough on it.
Here's how to do it with a regular expression:
import re
m = re.search(r'pll=(\d+\.\d+),(\d+\.\d+)', 'https://www.testweb.com/cordi?ll=41.403781,2.1896&z=17&pll=41.403781,2.1896')
print m.groups()
Result: ('41.403781', '2.1896')
You might want look at the module urlparse for a more robust solution.
urlparse has a functions "urlparse" and "parse_qs" for accessing this data reliably, as shown below
$ python
Python 2.6.6 (r266:84292, Jul 23 2015, 15:22:56)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> u="""https://www.testweb.com/cordi?ll=41.403781,2.1896&z=17&pll=41.403781,2.1896"""
>>> import urlparse
>>> x=urlparse.urlparse(u)
>>> x
ParseResult(scheme='https', netloc='www.testweb.com', path='/cordi', params='', query='ll=41.403781,2.1896&z=17&pll=41.403781,2.1896', fragment='')
>>> x.query
'll=41.403781,2.1896&z=17&pll=41.403781,2.1896'
>>> urlparse.parse_qs(x.query)
{'ll': ['41.403781,2.1896'], 'z': ['17'], 'pll': ['41.403781,2.1896']}
>>>
First of all, I've looked here: Sublime Text 3, Python 3 and UTF-8 don't like each other and read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets but am still none the wiser to the following:
Running Python from a file created in Sublime (not compiling) and executing via command prompt on an XP machine
I have a couple of text files named with accents (German, Spanish & French mostly). I want to remove accented characters (umlauts, acutes, graves, cidillas etc) and replace them with their equilivant non accented look a like.
I can strip the accents if they are a string from with the script. But accesing a textfile of the same name causes the the strippAcent function to fail. I'm all out of ideas as I think this is due to a conflict with Sublime and Python.
Here's my script
# -*- coding: utf-8 -*-
import unicodedata
import os
def stripAccents(s):
try:
us = unicode(s,"utf-8")
nice = unicodedata.normalize("NFD", us).encode("ascii", "ignore")
print nice
return nice
except:
print ("Fail! : %s" %(s))
return None
stripAccents("Découvrez tous les logiciels à télécharger")
# Decouvrez tous les logiciels a telecharger
stripAccents("Östblocket")
# Ostblocket
stripAccents("Blühende Landschaften")
# Bluhende Landschaften
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
for name in files:
x = name
x = stripAccents(x)
For the record:
C:\chcp
gets me 437
This is what the code produces for me:
The error in full is:
C:\WINDOWS\system32>D:\LearnPython\unicode_accents.py
Decouvrez tous les logiciels a telecharger
Ostblocket
Bluhende Landschaften
Traceback (most recent call last):
File "D:\LearnPython\unicode_accents.py", line 37, in <module>
x = stripAccents(x)
File "D:\LearnPython\unicode_accents.py", line 8, in stripAccents
us = unicode(s,"utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 2: invalid start byte
C:\WINDOWS\system32>
root = "D:\\temp\\test\\"
for path, subdirs, files in os.walk(root):
If you want to read Windows's filenames in their native Unicode form you have to ask for that specfically, by passing a Unicode string to filesystem functions:
root = u"D:\\temp\\test\\"
Otherwise Python will default to using the standard byte-based interfaces to the filesystem. On Windows, these return filenames to you encoded in the system's locale-specific legacy encoding (ANSI code page).
In stripAccents you try to decode the byte string you got from here using UTF-8, but the ANSI code page is never UTF-8, and the byte sequence you have doesn't happen to be a valid UTF-8 sequence so you get an error. You can decode from the ANSI code page using the pseudo-encoding mbcs, but it would be better to stick to Unicode filepath strings so you can include characters that don't fit in ANSI.
Always use Unicode strings to represent text in Python. Add from __future__ import unicode_literals at the top so that all "" literals would create Unicode strings. Or use u"" literals everywhere. Drop unicode(s, 'utf-8') from stripAccents(), always pass Unicode strings instead (try unidecode package, to transliterate Unicode to ascii).
Using Unicode solves several issues transparently:
there won't be UnicodeDecodeError because Windows provides Unicode API for filenames: if you pass Unicode input; you get Unicode output
you won't get a mojibake when a bytestring containing text encoded using your Windows encoding such as cp1252 is displayed in console using cp437 encoding e.g., Blühende -> Blⁿhende (ü is corrupted)
you might be able to work with text that can't be represented using neither cp1252 nor cp437 encoding e.g., '❤' (U+2764 HEAVY BLACK HEART).
To print Unicode text to Windows console, you could use win-unicode-console package.
The "raw_input" is generally used for taking a respond of prompt to a string, and then it can be also assigned(=) by a variable.
But I found something strange(for me) and can't well understand. In a mistake incident case(see example below), I put a equal(==) for assigning variable in a function. Then I run it in interpreter and didn't have any error. I through my script was well done but it could not work as my expert.
My questions:
I would like to know this usage of "raw_input" in Python is correct?
If yes, how we use it?
If not, why don't interpreter give us a error warning?
Thank you so much.
Example:
Python 2.7.3 (default, Apr 24 2012, 00:00:54)
[GCC 4.7.0 20120414 (prerelease)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> v = "Variable"
>>> def test():
... v == raw_input(">")
... print v
...
>>> test()
>Hello!
Variable
>>>
This:
v == raw_input(">")
is simply a comparison. You get True or False as a result and then throw it away, because you don't give it a name. You could write
comparison = v == raw_input(">")
print comparison
to see the value.