How to parse special characters in python's ArgumentParser()? - python-2.7

I am working with an application that receives parameters like a ":" separated string as node ID and a flow name that may contain several special characters. When I want to parse the arguments an error is triggered due to some issues with special characters like *. This is an example input:
python flowapp.py --remove 00:00:02:84:75:e2:95:42 UDP*node-3_to_node-4*dp9000__#node-1
Here is the code I am using to parse option "--remove" :
parser.add_argument("-r","--remove",help="remove the specified flow\
entry from a given node",nargs='2')
When I execute the app I get the following errors:
...
start_index = consume_optional(start_index)
File "/usr/lib/python2.7/argparse.py", line 1858, in consume_optional
arg_count = match_argument(action, selected_patterns)
File "/usr/lib/python2.7/argparse.py", line 2011, in _match_argument
nargs_pattern = self._get_nargs_pattern(action)
File "/usr/lib/python2.7/argparse.py", line 2176, in _get_nargs_pattern
nargs_pattern = '(-*%s-*)' % '-*'.join('A' * nargs)
TypeError: can't multiply sequence by non-int of type 'str'
Is there a way to tell python's argparser to interpret characters like -, * or # as special characters and not "math" operators?

The problem isn't with the special characters. The sys.argv will be something like:
['flowapp.py', '--remove', '00:00:02:84:75:e2:95:42', 'UDP*node-3_to_node-4*dp9000__#node-1']
which argparse should have no problems handling.
The problem is with the string argument to nargs:
parser.add_argument("-r","--remove",...,nargs='2')
As part of parsing it constructs an argument matching pattern, which depends on the nargs value:
nargs_pattern = '(-*%s-*)' % '-*'.join('A' * nargs)
TypeError: can't multiply sequence by non-int of type 'str'
If nargs is one of the special values like *, it must be an integer. You gave it a string.
parser.add_argument("-r","--remove",...,nargs=2)
In other versions of argparse, you might get a different error:
ValueError: length of metavar tuple does not match nargs
But it's the same issue.

Related

How to remove prefixed u from a unicode string?

I am reading the lines from a CSV file; I am applying LDA algorithm to find the most common topic, after data processing in doc_processed, I am getting 'u' in every word but why? Please suggest me to remove 'u' from the doc+processed, my code in Python 2.7 is
data = [line.strip() for line in open("/home/dnandini/test/h.csv", 'r')]
stop = set(stopwords.words('english'))# stop words
exclude = set(string.punctuation) #to reomve the punctuation
lemma = WordNetLemmatizer() # to map with parts of speech
def clean(doc):
stop_free = " ".join([i for i in doc.lower().split() if i not in stop])
punc_free = ''.join(ch for ch in stop_free if ch not in exclude)
normalized = " ".join(lemma.lemmatize(word) for word in punc_free.split())
shortword = re.compile(r'\W*\b\w{1,2}\b')
output=shortword.sub('', normalized)
return output
doc_processed = [clean(doc) for doc in data]
Output as doc_processed -
[u'amount', u'ze69heneicosadien11one', u'trap', u'containing', u'little', u'microg', u'zz69ket', u'attracted', u'male', u'low', u'population', u'level', u'indicating', u'potent', u'sex', u'attractant', u'trap', u'baited', u'z6ket', u'attracted', u'male', u'windtunnel', u'bioassay', u'least', u'100fold', u'attractive', u'male', u'zz69ket', u'improvement', u'trap', u'catch', u'occurred', u'addition', u'z6ket', u'various', u'binary', u'mixture', u'zz69ket', u'including', u'female', u'ratio', u'ternary', u'mixture', u'zz69ket']
the u'some string' format means it is a unicode string. See this question for more details on unicode strings themselves, but the easiest way to fix this is likely to str.encode the result before returning it from clean.
def clean(doc):
# all as before until
output = shortword.sub('', normalized).encode()
return output
Note that attempting to encode unicode code points that don't translate directly to the default encoding (which appears to be ASCII. See sys.getdefaultencoding() on your system to check) will throw an error here. You can handle the error in various ways be defining the errors kwarg to encode.
s.encode(errors="ignore") # omit the code point that fails to encode
s.encode(errors="replace") # replace the failing code point with '?'
s.encode(errors="xmlcharrefreplace") # replace the failing code point with ' '
# Note that the " " above is U+FFFD, the Unicode Replacement Character.

i want to search file for three strings and type 'defect' only if those both strings are present

I have a txt file with three debug signature present on them.
x = 'task cocaLc Requested reboot'
y = 'memPartFree'
z = 'memPartAlloc'
import re
f = open('testfile.txt','r')
searchstrings = ('task cocaLc Requested reboot', 'memPartFree', 'memPartAlloc')
for line in f():
for word in searchstrings:
if any (s in line for s in searchstrings):
print 'defect'
I want to create a short script to scan through the file and print 'defect' only if all these three strings are present.
I was trying creating with different ways, but unable to meet the requirement.
First, there is a small error on line 4 of the example code. f is not callable, and thus you shouldn't be using parenthesis next to it.
If you have a file with the following in it:
task cocaLc Requested reboot
memPartFree
memPartAlloc
It will print out "defect" 9 times because you're checking once for each line, and once for each search string. So three lines, times three search strings is 9.
The any() function will return True any time the file contains at least one of the defined search strings. Thus, this code will print out "defect" once for each line, multiplied by the number of search strings you've defined.
To resolve this, the program will need to know if/when any of the particular search strings have been detected. You might do something like this:
f = open('testfile.txt','r')
searchstrings = ['task cocaLc Requested reboot', 'memPartFree', 'memPartAlloc']
detections = [False, False, False]
for line in f:
for i in range(0, len(searchstrings)):
if searchstrings[i] in line: #loop through searchstrings using index numbers
detections[i] = True
break #break out of the loop since the word has been detected
if all(detections): #if every search string was detected, every value in detections should be true
print "defect"
In this code, we loop through the lines and the search strings, but the detection variable serves to tell us which search strings have been detected in the file. Thus, if all elements in that list are true, that means all of the search strings have been detected in the file.

How to get the line number of an exception in OCaml without debugging symbols?

Is there a good way to get the line number of exception in OCaml without debugging symbols? Certainly, if we turn on debugging symbols and run with OCAMLRUNPARAM=b, we can get the backtrace. However, I don't really need the whole backtrace and I'd like a solution without debugging symbols. At the moment, we can write code like
try
assert false
with x ->
failwith (Printexc.to_string x ^ "\nMore useful message")
in order to get the file and line number from assert, but this seems awkward. Is there a better way to get the file and line number of the exception?
There are global symbols __FILE__ and __LINE__ that you can use anywhere.
$ ocaml
OCaml version 4.02.1
# __FILE__;;
- : string = "//toplevel//"
# __LINE__;;
- : int = 2
#
Update
As #MartinJambon points out, there is also __LOC__, which gives the filename, line number, and character location in one string:
# __LOC__;;
- : string = "File \"//toplevel//\", line 2, characters -9--2"
Update 2
These symbols are defined in the Stdlib module (formerly known as Pervasives). The full list is: __LOC__, __FILE__, __LINE__, __MODULE__, __POS__, __LOC_OF__, __LINE_OF__, __POS_OF__.
The last three return information about a whole expression rather than just a single location in a file:
# __LOC_OF__ (8 * 4);;
- : string * int = ("File \"//toplevel//\", line 2, characters 2-9", 32)

What do the ">>" symbols mean in Python code: map(chr,[x,x>>8,y])

The error code I get in another file that uses it is:
Traceback (most recent call last):
File "C:\Anaconda\lib\site-packages\pyahoolib-0.2-py2.7.egg\yahoo\session.py", line 107, in listener
t.send_pk(consts.SERVICE_AUTHRESP, auth.hash(t.login_id, t.passwd, p[94]))
File "C:\Anaconda\lib\site-packages\pyahoolib-0.2-py2.7.egg\yahoo\auth.py", line 73, in hash
hs = md5.new(mkeystr+"".join(map(chr,[x,x>>8,y]))).digest()
ValueError: chr() arg not in range(256)
UPDATE: #merlin2011: This is confusing me. the code is hs = md5.new(mkeystr+"".join(map(chr,[x,x>>8,y]))).digest()
Where the chr has a comma after it. I thought it was a function from doc.python.org: chr(i)
Return a string of one character whose ASCII code is the integer i. For example, chr(97) returns the string 'a'. This is the inverse of ord(). The argument must be in the range [0..255], inclusive; ValueError will be raised if i is outside that range. See also unichr().
If so, is [x,x>>8,y] an iterable for map() I just don't recognize yet?
Also, I don't want to change any of this code because it is part of the pyahoolib-0.2 auth.py file. But to get it all working I do not know what to do.
It's the Binary Right Shift Operator:
From Python Wiki:
x >> y:
Returns x with the bits shifted to the right by y places. This is the same as integer-dividing (\\) x by 2**y.
In case you were wondering, the error message means that chr only accepts arguments inside the range 0 to 256, and your map function is causing it to be called with a value that is outside that range.

URL-Encoding a Byte String?

I am writing a Bittorrent client. One of the steps involved requires that the program sends a HTTP GET request to the tracker containing an SHA1 hash of part of the torrent file. I have used Fiddler2 to intercept the request sent by Azureus to the tracker.
The hash that Azureus sends is URL-Encoded and looks like this: %D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR
The hash should look like this before it's URL-Encoded: d90c3ce39418f0c5d98358e03349262b608cbf52
I notice that it is not as simple as placing a '%' symbol every two characters, so how would I go about encoding this BYTE string to get the same as Azureus.
Thanks in advance.
Actually, you can just place a % symbol every two characters. Azureus doesn't do that because, for example, R is a safe character in a URL, and 52 is the hexadecimal representation of R, so it doesn't need to percent-encode it. Using %52 instead is equivalent.
Go through the string from left to right. If you encounter a %, output the next two characters, converting upper-case to lower-case. If you encounter anything else, output the ASCII code for that character in hex using lower-case letters.
%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R
The ASCII code for X is 0x58, so that becomes 58. The ASCII code for 3 is 0x33.
(I'm kind of puzzled why you had to ask though. Your question clearly shows that you recognized this as URL-Encoded.)
Even though I know well the original question was about C++, it might be useful somehow, sometimes to see alternative solutions. Therefore, for what it's worth (10 years later), here's
An alternative solution implemented in Python 3.6+
import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
# decode hex string as a Windows-1252 string
win1252_str = binascii.unhexlify(hex_str).decode(encoding)
# escape string and return
return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
# unescape the escaped string as a Windows-1252 string
win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
# encode string, hexlify, and return
return win1252_str.encode('Windows-1252').hex()
Two elementary tests:
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True
Note
Windows-1252 (aka cp1252) emerged as the default encoding as a result of the following test:
import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
chardet.detect(
binascii.unhexlify(hex_str)
)
)
...which gave a pretty strong clue:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}