I have special characters in list and it breaks SikuliX - python-2.7

I try to get paths into list and everything is working just fine until I get special characters like ä or ö. In string they are represented as bytes for example ä is \xe4. If I use same Python script in Terminal I get all paths printed out correctly even though paths in list contain these bytes instead of actual letters.
Here is my code where I extract all the filenames:
def read_files(path):
"""
Read all files in folder specified by path
:param path: Path to folder which contents will be read
:return: List of all files in folder specified by path
"""
files = []
for f in listdir(path):
if isfile(join(path, f)):
files.append(make_unicode(join(path, f)))
return files
def make_unicode(string):
if type(string) != unicode:
string = string.decode('utf-8')
return string
I don't have any idea where to go from now on. I have tried practically everything I possibly could find from Google. This is more of a SikuliX problem than Python, because Python code works just fine outside SikuliX.
I use Python 2.7 and SikuliX 1.1.1.

So I got this covered. Problem was, that read_files(path) function was called again later and when the path was unicode with special characters marked as bytes the whole thing broke. I changed my code in a fashion that this function was called only once and then I was able to work with special characters.

Related

Regex - Filter for atypical filetypes

I have a folder filled with plain text files with filenames formatted as follows:
00001.7c53336b37003a9286aba55d2945844c
00002.9c4069e25e1ef370c078db7ee85ff9ac
00003.860e3c3cee1b42ead714c5c874fe25f7
00002.d94f1b97e48ed3b553b3508d116e6a09
00001.7848dde101aa985090474a91ec93fcf0
After I acquire the filenames as strings, how can I filter them so that all relevant files are accepted and everything else is rejected?
I could reformat all files in a controlled environment to strip the
string up to the the ., then add another . and a constant
filetype.
I could try to set a fixed acceptable value for the length of the
string after the ..
I could exclude some specific filetypes and hope nothing else slips
through.
All this methods require me to rename the files or make sure in first person that there is nothing else in the folder.
The files all have a very long extension. You could use the following to select files which have exactly 32 character extension.
\.[^.]{32}$
Or something like
\.[^.]{8,}$
Which matches files whose extension is at least 8 characters.
A close look reveals that (at least) in your example the only alphabetic characters are a, b, ..., f so you could restrict your search more with:
\.[0-9a-f]{8,}$
Also in all the example the file name has exactly 5 digits and start with (at least) double 0 which we could incorporate with:
^0{2}\d{3}\.[0-9a-f]{8,}$

WindowsError: [Error 3] The system cannot find the path specified (when path too long?)

WindowsError: [Error 3] The system cannot find the path specified (when path too long?)
I'm making a script to find unique files between two directories. In order to do this, I use os.walk() to walk through the files, and if files of the same size exist, I hash them to ensure they are the same (opening the files in the process). The problem is that some files produce the above-mentioned error when opened. The most common reason people run into this problem is because the path is not correctly joined, thereby causing the script to try and open a file that doesn't exist. This isn't the case for me.
After trying different combinations of directories, I began to notice a pattern whereby files that produce the error seem to have a deep directory structure and a long filename. I can't think of any other reason for the issue - there are no character-encoding errors (I decode all my paths to UTF-8) and the paths do exist by virtue of os.walk().
My walk code:
for root, dirs, files in os.walk(directory):
for filename in files:
file_path = os.path.join(root, filename)
My hashing code:
def hash(file_path):
with open(dir_file, 'rb') as f:
hasher = hashlib.md5()
while True:
buf = f.read(byte_size)
if buf != '':
hasher.update(buf)
else:
break
result = hasher.hexdigest()
return result
Edit: The most recent path where the issue appeared was 5 directories deep (containing 142 characters, accounting for double backslashes), and the filename was an additional 122 characters long
That's due to Windows API file path size limitation as explained on MSDN:
In the Windows API (with some exceptions discussed in the following paragraphs), the maximum length for a path is MAX_PATH, which is defined as 260 characters. A local path is structured in the following order: drive letter, colon, backslash, name components separated by backslashes, and a terminating null character. For example, the maximum path on drive D is "D:\some 256-character path string" where "" represents the invisible terminating null character for the current system codepage. (The characters < > are used here for visual clarity and cannot be part of a valid path string.)
As also explained on that page, newer versions of Windows support extended file path prefix (\\?\) used for Unicode paths and such, but that's not a consistent or guaranteed behavior i.e. it doesn't mean it will work in all cases.
Either way, try prepending your path with the extended path prefix and see if it works for your case:
file_path = "\\\\?\\" + os.path.join(root, filename)

Python application cannot search for strings that contain utf-8 characters

I have created a small tool using Tkinter that enters a string using entry widget , search for that string in multiple files and displays the list of file names that contains that string in listbox. All the files are utf-8 encoded already.
Now the problem is, when I run my code from IDE(Pycharm), and input search string that contains a utf-8 character in the tool UI, it works fine and searched all files that contains it.
But if I create a exe file of that code(using py2exe), and launch the tool , enter the same string, it cannot search and code continues to search non-stop.(With non- utf-8 characters, it works fine)
In the application code, I have 'imported codecs' and opened file using command
codecs.open(SourceFile, encoding ='utf-8')
Please help me to solve this problem that how exe file can also work correct and search strings successfully.

filename sequence extraction python

I have to identify and isolate a number sequence from the file names in a folder of files, and optionally, identify non-continuous sequences. The files are .dpx files. There is almost no file naming structure except that somewhere in the filename is a sequence number, and an extention of '.dpx'. There is a wonderful module called PySeq that can do all of the hard work, except it just bombs with a directory of thousands and sometimes hundreds of thousands of files. "Argument list too large". Has anyone had experience working with sequence number isolation and dpx files in particular? Each file can be up to 100MB in size. I am working on a CentOS box using Python2.7.
File names might be something like:<br/>
test00_take1_00001.dpx<br/>
test00_take1_00002.dpx<br/>
another_take_ver1-0001_3.dpx<br/>
another_take_ver1-0002_3.dpx<br/>
(Two continuous sequences)
This should do exactly what you're looking for. It will create a dict of dicts containing start and end of strings and putting the full string in a list.
It will then join all of the lists into a single list (You might as well skip on this part and turn it into a generator of lists for higher efficiency regarding memory)
from collections import defaultdict
input_list = [
"test00_take1_00001.dpx",
"test00_take1_00002.dpx",
"another_take_ver1-0001_3.dpx",
"another_take_ver1-0002_3.dpx"]
results_dict = defaultdict(lambda: defaultdict(list))
matches = (re.match(r"(.*?[\W_])\d+([\W_].*)", item) for item in input_list)
for match in matches:
results_dict[match.group(1)][match.group(2)].append(match.group(0))
results_list = [d2 for d1 in results_dict.values() for d2 in d1.values()]
>>> results_list
[['another_take_ver1-0001_3.dpx', 'another_take_ver1-0002_3.dpx'], ['test00_take
1_00001.dpx', 'test00_take1_00002.dpx']]

Python 2.7: Handeling Unicode Objects

I have an application that needs to be able to handle non-ASCII characters of unknown encoding. The program may delete or replace these characters (if they are discovered in a user dictionary file), otherwise they need to pass cleanly through unaltered. What's mind-boggling is, it works one minute, then I make some seemingly trivial change, and now it fails with UnicodeDecode, UnicodeEncode, or kindred errors. Addressing this has led me down the road of cargo cult programing--making random tweaks that get it working again, but I have no idea why. Is there a general-purpose solution for dealing this, perhaps even the creation of class that modifies the normal way Python deals with strings?
I'm not sure what code to include as about five separate modules are involved. Here is what I am doing in abstract terms:
Taking a text from one of two sources: text that the user has pasted directly into a Tkinter toplevel window; text captured from the Win32 clipboard via a hotkey command.
The text is processed, including the removal of whitespace charters, then certain characters/words are replaced or simply deleted based on a customizable user dictionary.
The result is then returned to the Tkinter GUI or the Win32 clipboard, depending on whether or not the keyboard shortcut was used.
Some details that may be relevant:
All modules use
# -*- coding: utf-8 -*-
The user dictionary is saved in UTF-16 LE with BOM (a function removes BOM characters when parsing the file). The file object is instantiated with
self.pf = codecs.open(self.pattern_fn, 'r', 'utf-16')
The text entry points for text are via a Tkinter GUI Text widget:
text = self.paste_to_field.get(1.0, Tkinter.END)
Or from the clipboard:
text = win32clipboard.GetClipboardData(win32clipboard.CF_UNICODETEXT)
And example error:
File "C:\Python27\lib\encodings\cp437.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u201d' in position
2: character maps to <undefined>
Furthermore, the same text might work when tested on OS X (where I do development work) but cause an error on Windows.
Regular expressions are used, however in this case no non-ASCIIs are included in the pattern. For non-ASCIIs I simply
text = text.replace(old, new)
Another thing to consider: for c in text type iterations are no good because a non-ASCII may look like several characters to Python. The normal word/character distinction no longer holds. Also, using bad_letter = repr(non_ASCII) doesn't help since str(bad_letter) merely returns a string of the escape sequence--it can't restore the original character.
Sorry if this is extremely vague. Please let me know what info I can provide to help clarify. Thanks in advance for reading this.