Python 2.7: Handeling Unicode Objects - python-2.7

I have an application that needs to be able to handle non-ASCII characters of unknown encoding. The program may delete or replace these characters (if they are discovered in a user dictionary file), otherwise they need to pass cleanly through unaltered. What's mind-boggling is, it works one minute, then I make some seemingly trivial change, and now it fails with UnicodeDecode, UnicodeEncode, or kindred errors. Addressing this has led me down the road of cargo cult programing--making random tweaks that get it working again, but I have no idea why. Is there a general-purpose solution for dealing this, perhaps even the creation of class that modifies the normal way Python deals with strings?
I'm not sure what code to include as about five separate modules are involved. Here is what I am doing in abstract terms:
Taking a text from one of two sources: text that the user has pasted directly into a Tkinter toplevel window; text captured from the Win32 clipboard via a hotkey command.
The text is processed, including the removal of whitespace charters, then certain characters/words are replaced or simply deleted based on a customizable user dictionary.
The result is then returned to the Tkinter GUI or the Win32 clipboard, depending on whether or not the keyboard shortcut was used.
Some details that may be relevant:
All modules use
# -*- coding: utf-8 -*-
The user dictionary is saved in UTF-16 LE with BOM (a function removes BOM characters when parsing the file). The file object is instantiated with
self.pf = codecs.open(self.pattern_fn, 'r', 'utf-16')
The text entry points for text are via a Tkinter GUI Text widget:
text = self.paste_to_field.get(1.0, Tkinter.END)
Or from the clipboard:
text = win32clipboard.GetClipboardData(win32clipboard.CF_UNICODETEXT)
And example error:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u201d' in position
2: character maps to <undefined>
Furthermore, the same text might work when tested on OS X (where I do development work) but cause an error on Windows.
Regular expressions are used, however in this case no non-ASCIIs are included in the pattern. For non-ASCIIs I simply
text = text.replace(old, new)
Another thing to consider: for c in text type iterations are no good because a non-ASCII may look like several characters to Python. The normal word/character distinction no longer holds. Also, using bad_letter = repr(non_ASCII) doesn't help since str(bad_letter) merely returns a string of the escape sequence--it can't restore the original character.
Sorry if this is extremely vague. Please let me know what info I can provide to help clarify. Thanks in advance for reading this.

Related

Is it possible to make an index search by regex in PDF?

I want to search for all lines that match this regex
^([0-9IVX]\.)*.*\R
and report with the page number they are at. The output would be something like:
1. Heading/page number
1.1 Subheading/page number
1.1.1. Subsubheading/page number
Is this possible to do in PDF? I suppose that would require Ghostscript, but searching the How to Use Ghostscript page for regex I find nothing.
I can't think why you would expect Ghostscript to do search for you.
I'm not sure if you are hoping to get the data type 'heading, page number' etc from the PDF file, or if you are going to work that out yourself based on the data you find.
If it's the former then the first problem is that, in general, PDF files don't have the kind of structure information you are looking for. There is nothing in most PDF files which says 'this is a heading', 'this is a page number' etc.
There are such things as 'tagged PDF' which adds non-printing elements to a PDF file which do carry that kind of data around with them. This is an entirely optional feature, the vast majority of PDF files don't contain it, and Ghostscript completely ignores it.
Since most PDF files don't have that information, you can't rely on it, unless you are in the happy position of knowing where your PDF files are being generated and that they contain this kind of information. In which case there are numerous tools around which will extract it for you, or enable you to write code to do so.
The problem with just searching for the text is that firstly the text need not be written as a contiguous stream. So if you are looking for '1.1' that might be written as:
(1.1) Tj
(1) Tj
(.) Tj
(1) Tj
[(1) -0.1 (.) 0.1 (1)] TJ
or any combination of those. The individual character codes need not even appear in order or in the same content stream.
Secondly the character code in a PDF content stream need not be (and often is not) a Unicode code point. Or ASCII, or any other standard coding scheme, it can be totally arbitrary.
Some PDF files carry a ToUnicode CMap around which maps the character codes to Unicode code points, but not all do. Some fonts may use a standard (that's PDF standard) Encoding, in which case it's possible to infer the Unicode code points. Some Encodings may contain glyph names, from which it's again possible to infer Unicode code points.
In the end though, some PDF files are simply impossible to extract text from without using OCR.
Your best bet is probably to write code to extract text, and Ghostscript will do that. It even goes through the heirarchy of fallbacks listed above to try and find a Unicode code point. If all else fails it just uses the character code and hopes that's good enough.
If you use Ghostscript's txtwrite device it will produce either a faked up text page (the default) which attempts, as far as possible, to mimic the text layout in the original PDF file, including merging bits of text that aren't contiguous in the PDF file but are next to each other on the page. Or an 'XML-like' output which will tell you which Unicode code points, or character codes, were encountered and what their position is on the original page. If you don't like txtwrite's attempts to figure out which text goes with what, then you can use this to write your own.
I suspect the text page is probably good enough for your purposes. You can have the txtwrite device produce one file per page, so you can get the page number from the filename. Then you can write your own regex expression(s) to search the files and find your matches.

How to find and replace box character in text file?

I have a large text file that I'm going to be working with programmatically but have run into problems with a special character strewn throughout the file. The file is way too large to scan it looking for specific characters. Most of the other unwanted special characters I've been able to get rid of using some regex pattern. But there is a box character, similar to "□". When I tried to copy the character from the actual text file and past it here I get "�", so the example of the box is from Windows character map which includes the code 'U+25A1', which I'm not sure how to interpret or if it's something I could use for a regex search.
Would anyone know how I could search for the box symbol similar to "□" in a UTF-8 encoded file?
EDIT:
Here is an example from the text file:
"� Prune palms when flower spathes show, or delay pruning until after the palm has finished flowering, to prevent infestation of palm flower caterpillars. Leave the top five rows."
The only problem is that, as mentioned in the original post, the square gets converted into a diamond question mark.
It's unclear where and how you are searching, although you could use the hex equivalent:
\x{25A1}
Example:
https://regex101.com/r/b84oBs/1
The black diamond with a question mark is not a character, per se. It is what a browser spits out at you when you give it unrecognizable bytes.
Find out where that data is coming from.
Determine its encoding. (Usually UTF-8, but might be something else.)
Be sure the browser is configured to display that encoding. This is likely to suffice <meta charset=UTF-8> in the header of the page.
I found a workaround using Notepad++ and this website. It's still not clear what encoding system the square is originally from, but when I post it into the query field in the website above or into the Notepad++ Conversion Table (Plugins > Converter > Conversion Table) it gives the hex-character code for the "Replacement Character" which is the diamond with the question mark.
Using this code in a regex expression, \x{FFFD}, within Notepad++ search gave me all the squares, although recognizing them as the Replacement Character.

Python 2.7 "latin-1" encoding used instead of "UTF-8"

I am aware that there are plenty of discussions on the "UTF-8" encoding issue on Python 2 but I was unable to find a solution to my problem so far. I am currently creating a script to get the name of a file and hyperlink it in xlwt, so that the file can be accessed by clicks in the spreadsheet. Problem is, some of the names of these files include non-ASCII characters.
Question 1
I used the following line to retrieve the name of the file. There is only one file in the folder by the way.
>>f = filter(os.path.isfile, os.listdir(tmp_path))[0]
And then
>>print f
'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
>>print sys.stdout.encoding
'UTF-8'
>>f.decode("UTF-8")
*** UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 76: invalid continuation byte
From browsing the discussions here, I realized that "\xe7\xe3o" is not a "UTF-8" encoding. Running the following line seems to back this point.
>>f.decode("latin-1")
u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
My question is then, why is the variable f being encoded in "latin-1" when the system encoding is set to "UTF-8"?
Question 2
While f.decode("latin-1") gives me the output that I want, I am still unable to supply the variable to the hyperlink function in the spreadsheet.
>>data.append(["File", xlwt.Formula('HYPERLINK("%s";"%s")' % (os.path.join(dl_path,f.decode("latin-1")),f.decode("latin-1")))])
*** FormulaParseException: can't parse formula HYPERLINK("u'H:\\Mad Lab\\SE Doc Crawler\\bovespa\\download\\521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's;"u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc's)
Apparently, the closing double quote got eaten up and was replaced by a " 's" suffix. Can somebody help to explain what's going on here? 0.0
Oh and if someone can suggest a solution to Question 2 above then I will be very grateful - for you would have saved my weekend from misery!
Thanks in advance all!
Welcome to the confusing world of encoding! There's at least file encoding, terminal encoding and filename encoding to deal with, and all three could be different.
In Python 2.x, the goal is to get a Unicode string (different from str) from an encoded str. The problem is that you don't always know the encoding used for the str so it's difficult to decode it.
When using listdir() to get filenames, there's a documented but often overlooked quirk - if you pass a str to listdir() you get encoded strs back. These will be encoded according to your locale. On Windows these will be an 8bit character set, like windows-1252.
Alternatively, pass listdir() a Unicode string instead.
E.g.
os.listdir(u'C:\\mydir')
Note the u prefix
This will return properly decoded Unicode filenames. On Windows and OS X, this is pretty reliable as long your environment locale hasn't been messed with.
In your case, listdir() would return:
u'521001ldrAvisoAcionistas(Retifica\xe7\xe3o)_doc'
Again, note the u prefix. You can now print this to your PyCharm console with no modification.
E.g.
f = filter(os.path.isfile, os.listdir(tmp_path))[0]
print f
As for Question 2, I did not investigate further but just printed the output as unicode strings, rather than xlwt objects, due to time constraint. I'm able to continue with the project, though without the understanding of what went wrong here. In that sense, the 2 questions above have been answered.

Pygments syntax highlighter in python tkinter text widget

I have been trying to incorporate syntax highlighting with the tkinter text widget. However, using the code found on this post, I cannot get it to work. There are no errors, but the text is not highlighted and a line is skipped after each character. If there is a better way to incorporate syntax highlighting with the tkinter text widget, I would be happy to hear it. Here is the smallest code I could find that replicates the issue:
import Tkinter
import ScrolledText
from pygments import lex
from pygments.lexers import PythonLexer
root = Tkinter.Tk(className=" How do I put an end to this behavior?")
textPad = ScrolledText.ScrolledText(root, width=100, height=80)
textPad.tag_configure("Token.Comment", foreground="#b21111")
code = textPad.get("1.0", "end-1c")
# Parse the code and insert into the widget
def syn(event=None):
for token, content in lex(code, PythonLexer()):
textPad.insert("end", content, str(token))
textPad.pack()
root.bind("<Key>", syn)
root.mainloop()
So far, I have not found a solution to this problem (otherwise I would not be posting here). Any help regarding syntax highlighting a tkinter text widget would be appreciated.
Note: This is on python 2.7 with Windows 7.
The code in the question you linked to was designed more for highlighting already existing text, whereas it looks like you're trying to highlight it as you type.
I can give some suggestions to get you started, though I've never done this and don't know what the most efficient solution is. The solution in this answer is only a starting point, there's no guarantee it is actually suited to your problem.
The short synopis is this: don't set up a binding that inserts anything. Instead, just highlight what was inserted by the default bindings.
To do this, the first step is to bind on <KeyRelease> rather than <Key>. The difference is that <KeyRelease> will happen after a character has been inserted whereas <Key> happens before a character is inserted.
Second, you need to get tokens from the lexer, and apply tags to the text for each token. To do that you need to keep track where in the document the lexer is, and then use the length of the token to determine the end of the token.
In the following solution I create a mark ("range_start") to designate the current location in the file where the pygments lexer is, and then compute the mark "range_end" based on the start, and the length of the token returned by pygments. I don't know how robust this is in the face of multi-byte characters. For now, lets assume single byte characters.
def syn(event=None):
textPad.mark_set("range_start", "1.0")
data = textPad.get("1.0", "end-1c")
for token, content in lex(data, PythonLexer()):
textPad.mark_set("range_end", "range_start + %dc" % len(content))
textPad.tag_add(str(token), "range_start", "range_end")
textPad.mark_set("range_start", "range_end")
This is crazy inefficient since it re-applies the highlighting to the whole document on every keypress. There are ways to minimize that, such as only highlighting after each word, or when the GUI goes idle, or some other sort of trigger.
To highlight certain words you can do this :
textarea.tag_remove("tagname","1.0",tkinter.END)
first = "1.0"
while(True):
first = textarea.search("word_you_are_looking_for", first, nocase=False, stopindex=tkinter.END)
if not first:
break
last = first+"+"+str(len("word_you_are_looking_for"))+"c"
textarea.tag_add("tagname", first, last)
first = last
textarea.tag_config("tagname", foreground="#00FF00")

Unicode, UTF-8, UTF-16 and UTF-32 questions [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
I read a lot about Unicode, ASCII, code pages, all the history, the invention of UTF-8, UTF-16 (UCS-2), UTF-32 (UCS-4) and who use them and so on, but I still having some questions that I tried hardly to find answers but I couldn't and I hope you to help me.
1 - Unicode is a standard for encoding characters and they specify a code point for each character. Something like U+0000 (example). Imagine that I have a file that has those code points (\u0000), in which point of my application I'm going to use it?
This might be a silly question but I really don't know in which point of my application I'm going to use it.
I'm creating an application that can read file that has those code points using the escape \u and I know that I can read it, decode it but now the next question.
2 - To which character set (code page) do I need to convert it? I saw some C++ libraries that they uses the name utf8_to_unicode or utf8-to-utf16 and also only utf8_decode, and this is what makes me confuse.
I don't know if will appear answers like this, but some might say: You need to convert it into code pages that you are going to use, but what if my application needs to be internationalized?
3 - I was wondering, in C++ if I try to display non-ASCII characters on terminal I got some confusing words. The question is: What makes the characters to be displayed are the fonts?
#include <iostream>
int main()
{
std::cout << "ö" << std::endl;
return 0;
}
The output (Windows):
├Â
4 - In which part of that process the encoding enter? It encodes, takes the code point and try to find the word that is equal on the fonts?
5 = WebKit is an engine for rendering web pages in web browsers, if you specify the charset as UTF-8 it works nicely with all characters, but if I specify another charset it doesn't, doesn't matter the font that I'm using, what happen?
<html>
<head>
<meta charset="iso-8859-1">
</head>
<body>
<p>ö</p>
</body>
</html>
The output:
ö
Works using:
<meta charset="utf-8">
6 - Imagine now that I read the file, I encode it, I have all the code points and I need to save the file again. Do I need to save it encoded (\u0000) or I need to decode first to transform again into characters and then save?
7 - Why the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16? (source)
That's all for now. Thanks in advance.
I'm creating an application that can read file that has those code points using the escape \u and I know that I can read it, decode it but now the next question.
If you're writing a program that processes some kind of custom escapes, such as \uXXXX, it's entirely up to you when to convert these escapes into Unicode code points.
To which character set (code page) do I need to convert it?
That depends on what you want to do. If you're using some other library that requires a specific code page then it's up to you to convert data from one encoding into the encoding required by that library. If you don't have any hard requirements imposed by such third party libraries then there may be no reason to do any conversion.
I was wondering, in C++ if I try to display non-ASCII characters on terminal I got some confusing words.
This is because various layers of the technology stack use different encodings. From the sample output you give, "├Â" I can see that what's happening is that your compiler is encoding the string literal as UTF-8, but the console is using Windows codepage 850. Normally when there are encoding problems with the console you can fix them by setting the console output codepage to the correct value, unfortunately passing UTF-8 through std::cout currently has some unique problems. Using printf instead worked for me in VS2012:
#include <cstdio>
#include <Windows.h>
int main() {
SetConsoleOutputCP(CP_UTF8);
std::printf("%s\n", "ö");
}
Hopefully Microsoft fixes the C++ libraries if they haven't already done so in VS 14.
In which part of that process the encoding enter? It encodes, takes the code point and try to find the word that is equal on the fonts?
Bytes of data are meaningless unless you know the encoding. So the encoding matters in all parts of the process.
I don't understand the second question here.
if you specify the charset as UTF-8 it works nicely with all characters, but if I specify another charset it doesn't, doesn't matter the font that I'm using, what happen?
What's going on here is that when you write charset="iso-8859-1" you also have to actually convert the document to that encoding. You're not doing that and instead you're leaving the document as UTF-8 encoded.
As a little exercise, say I have a file that contains the following two bytes:
0xC3 0xB6
Using information on UTF-8 encoding and decoding, what codepoint do the bytes decode to?
Now using this 8859-1 codepage, what do the same bytes decode to?
As another exercise, save two copies of your HTML document, one using charset="iso-8859-1" and one with charset="utf-8". Now use a hex editor and examine the contents of both files.
Imagine now that I read the file, I encode it, I have all the code points and I need to save the file again. Do I need to save it encoded (\u0000) or I need to decode first to transform again into characters and then save?
This depends on the program that will need to read the file. If the program expects all non-ASCII characters to be escaped like that then you have to save the file that way. But escaping characters with \u is not a normal thing to do. I only see this done in a few places, such as JSON data and C++ source code.
Why the word "unicode" is a bit overloaded and is sometimes understood to mean utf-16?
Largely because Microsoft uses the term this way. They do so for historical reasons: When they added Unicode support they named all their options and setting "Unicode" but the only encoding they supported was UTF-16.