Unpacking data in python in struct library - python-2.7

When I pack the data to fixed length and then while unpacking I am unable to retrieve the data with out mentioning the actual length of the data.
How do I retrieve only data without the \x00 characters without calculating the length in prior.
>>> import struct
>>> with open("forums_file.dat", "w") as file:
file.truncate(1024)
>>> country = 'india'
>>> data = struct.pack('20s', country)
>>> print data
india
>>> data
'india\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> print len(data)
20
>>> unpack_data = struct.unpack('5s', country)
>>> unpack_data
('india',)
In the above code snippet I had mentioned the length of the data(5s) while unpacking.

Short answer: You can't do it directly.
Longer answer:
The more indirect solution is actually not that bad. When unpacking the string, you use the same length as you used for packing. That returns the string including the NUL chars (0 bytes).
Then you split on the NUL char and take the first item, like so:
result_with_NUL, = struct.unpack('20s', data)
print(repr(result_with_NUL))
result_string = result_with_NUL.split('\x00', 1)[0]
print(repr(result_string))
The , 1 parameter in split() is not strictly necessary, but makes it more efficient, as it splits only on the first occurrence of NUL instead of every single one.
Also note that when packing and unpacking with the goal to read/write files or exchange data with different systems, it's important to explicitly precede your format strings with "<" or ">" (or "=" in certain very special cases), both for packing and unpacking, since otherwise it will align and pad the structures, which is heavily system dependent and might cause hard to find bugs later.

Related

How to read a certain number of characters (as opposed to bytes) in Crystal?

In Crystal, if I have a string (or a file), how do I read a certain number of characters at a time? Using functions like IO#read, IO#gets, IO#read_string, and IO#read_utf8, one can specify a certain number of bytes to read, but not a certain number of UTF-8 characters (or ones of another encoding).
In Python, for example, one might do this:
from io import StringIO
s = StringIO("abcdefgh")
while True:
chunk = s.read(4)
if not chunk: break
Or, in the case of a file, this:
with open("example.txt", 'r') as f:
while True:
chunk = f.read(4)
if not chunk: break
Generally, I'd expect IO::Memory to be the class to use for the string case, but as far as I can tell, its methods don't allow for this. How would one do this in an efficient and idiomatic fashion (for both strings and files – perhaps the answer is different for each) in Crystal?
There currently is no short cut implementation for this available in Crystal.
You can read individual chars with IO#read_char or consecutive ones with IO#each_char.
So a basic implementation would be:
io = IO::Memory.new("€abcdefgh")
string = String.build(4) do |builder|
4.times do
builder << io.read_char
end
end
puts string
Whether you use a memory IO or a file or any other IO is irrelevant, the behaviour is all the same.
io = IO::Memory.new("€€€abc€€€") #UTF-8 string from memory
or
io = File.open("test.txt","r") #UTF-8 string from file
iter = io.each_char.each_slice(4) #read max 4 chars at once
iter.each { |slice| #into a slice
puts slice
puts slice.join #join to a string
}
output:
['€', '€', '€', 'a']
€€€a
['b', 'c', '€', '€']
bc€€
['€']
€
In addition to the answers already given, for strings in Crystal, you can read X amount of characters with a range like this:
my_string = "A foo, a bar."
my_string[0..5] => "A foo"
This workaround seems to work for me:
io = IO::Memory.new("abcdefghz")
chars_to_read = 2 # Number of chars to read
while true
chunk = io.gets(chars_to_read) # Grab the chunk of type String?
break if chunk.nil? # Break if nothing else to read aka `nil`
end

Converting a string of numbers to hex and back to dec pandas python

I currently have a string of values which I retrieved after filtering through data from a csv file. ultimately I had to do some filtering of the data but I have the same numbers as a list, dataframe, or array. I just need to take the numbers in the string and convert them to hex and then take the first 8 numbers of the hex and convert that to dec for each element in the string. Lastly I also need to convert the last 8 of the same hex and then to dec as well for each value in the string.
I cannot provide a snippet because it is sensitive data, but here is an example.
I basically have something like this
>>> list_A
[52894036, 78893201, 45790373]
If I convert it to a dataframe and call df.dtypes, it says dtype: object and I can convert the values of Column A to bool, int, or string, but the dtype is always an object.
It does not matter whether it is a function, or just a simple loop. I have been trying many methods and am unable to attain the results I need. But ultimately the data is taken from different csv files and will never be the same values or list size.
Pandas is designed to work primarily with integers and floats, with no particular facilities for hexadecimal that I know of, but you can use apply to access standard python conversion functions like hex and int:
df=pd.DataFrame({ 'a':[52894036999, 78893201999, 45790373999] })
df['b'] = df['a'].apply( hex )
df['c'] = df['b'].apply( int, base=0 )
Results:
a b c
0 52894036999 0xc50baf407 52894036999
1 78893201999 0x125e66ba4f 78893201999
2 45790373999 0xaa951a86f 45790373999
Note that this answer is for Python 3. For Python 2 you may need to strip off the trailing "L" in column "b" with str[:-1].

How to smooth numbers from a file as many times as wanted in Python 2.7?

I'm trying to create a code that will open a file with a list of numbers in it and then take those numbers and smooth them as many times as the user wants. I have it opening and reading the file, but it will not transpose the numbers. In this format it gives this error: TypeError: unsupported operand type(s) for /: 'str' and 'float'. I also need to figure out how to make it transpose the numbers the amount of times the user asks it to. The list of numbers I used in my .txt file is [3, 8, 5, 7, 1].
Here is exactly what I am trying to get it to do:
Ask the user for a filename
Read all floating point data from file into a list
Ask the user how many smoothing passes to make
Display smoothed results with two decimal places
Use functions where appropriate
Algorithm:
Never change the first or last value
Compute new values for all other values by averaging the value with its two neighbors
Here is what I have so far:
filename = raw_input('What is the filename?: ')
inFile = open(filename)
data = inFile.read()
print data
data2 = data[:]
print data2
data2[1]=(data[0]+data[1]+data[2])/3.0
print data2
data2[2]=(data[1]+data[2]+data[3])/3.0
print data2
data2[3]=(data[2]+data[3]+data[4])/3.0
print data2
You almost certainly don't want to be manually indexing the list items. Instead, use a loop:
data2 = data[:]
for i in range(1, len(data)-1):
data2[i] = sum(data[i-1:i+2])/3.0
data = data2
You can then put that code inside another loop, so that you smooth repeatedly:
smooth_steps = int(raw_input("How many times do you want to smooth the data?"))
for _ in range(smooth_steps):
# code from above goes here
Note that my code above assumes that you have read numeric values into the data list. However, the code you've shown doesn't do this. You simply use data = inFile.read() which means data is a string. You need to actually parse your file in some way to get a list of numbers.
In your immediate example, where the file contains a Python formatted list literal, you could use eval (or ast.literal_eval if you wanted to be a bit safer). But if this data is going to be used by any other program, you'll probably want a more widely supported format, like CSV, JSON or YAML (all of which have parsers available in Python).

URL-Encoding a Byte String?

I am writing a Bittorrent client. One of the steps involved requires that the program sends a HTTP GET request to the tracker containing an SHA1 hash of part of the torrent file. I have used Fiddler2 to intercept the request sent by Azureus to the tracker.
The hash that Azureus sends is URL-Encoded and looks like this: %D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR
The hash should look like this before it's URL-Encoded: d90c3ce39418f0c5d98358e03349262b608cbf52
I notice that it is not as simple as placing a '%' symbol every two characters, so how would I go about encoding this BYTE string to get the same as Azureus.
Thanks in advance.
Actually, you can just place a % symbol every two characters. Azureus doesn't do that because, for example, R is a safe character in a URL, and 52 is the hexadecimal representation of R, so it doesn't need to percent-encode it. Using %52 instead is equivalent.
Go through the string from left to right. If you encounter a %, output the next two characters, converting upper-case to lower-case. If you encounter anything else, output the ASCII code for that character in hex using lower-case letters.
%D9 %0C %3C %E3 %94 %18 %F0 %C5 %D9 %83 X %E0 3 I %26 %2B %60 %8C %BF R
The ASCII code for X is 0x58, so that becomes 58. The ASCII code for 3 is 0x33.
(I'm kind of puzzled why you had to ask though. Your question clearly shows that you recognized this as URL-Encoded.)
Even though I know well the original question was about C++, it might be useful somehow, sometimes to see alternative solutions. Therefore, for what it's worth (10 years later), here's
An alternative solution implemented in Python 3.6+
import binascii
import urllib.parse
def hex_str_to_esc_str(s: str, *, encoding: str='Windows-1252') -> str:
# decode hex string as a Windows-1252 string
win1252_str = binascii.unhexlify(hex_str).decode(encoding)
# escape string and return
return urllib.parse.quote(win1252_str, encoding=encoding)
def esc_str_to_hex_str(s: str, *, encoding: str='Windows-1252') -> str:
# unescape the escaped string as a Windows-1252 string
win1252_str = urllib.parse.unquote(esc_str, encoding='Windows-1252')
# encode string, hexlify, and return
return win1252_str.encode('Windows-1252').hex()
Two elementary tests:
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(hex_str_to_esc_str(hex_str) == esc_str) # True
print(esc_str_to_hex_str(esc_str) == hex_str) # True
Note
Windows-1252 (aka cp1252) emerged as the default encoding as a result of the following test:
import binascii
import chardet
esc_str = '%D9%0C%3C%E3%94%18%F0%C5%D9%83X%E03I%26%2B%60%8C%BFR'
hex_str = 'd90c3ce39418f0c5d98358e03349262b608cbf52'
print(
chardet.detect(
binascii.unhexlify(hex_str)
)
)
...which gave a pretty strong clue:
{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

Regex for template tag with attributes

I haven't found my answer after reading through all of these posts, so I'm hoping one of you heavy hitter regex folks can help me out. I'm trying to isolate the tag name and any attributes from the following string format:
{TAG:TYPE attr1="foo" attr2="bar" attr3="zing" attr4="zang" attr5="zoom" ...}
NOTE: in the above example, TAG will always be the same and TYPE will be one of several preset strings (e.g. share,print,display etc...). TAG and TYPE are uppercased only for the example but will not be case sensitive for real.
For the moment, let's assume that your attribute names and values, as well as your TAG and TYPE, are strictly alphanumeric. Parsing gets messier (and may not even be regular) if you could have " or = inside those strings.
With those caveats, here's a python regex that gets the job done:
>>> parse_regex=r'\{(?P<tag>\w+):(?P<type>\w+)(?P<attrs>(\s+\w+=\"\w+\")*)\}'
>>> m = re.match(parse_regex, str)
>>> m.group('tag')
'TAG'
>>> m.group('type')
'TYPE'
>>> m.group('attrs')
' attr1="foo" attr2="bar" attr3="zing" attr4="zang" attr5="zoom"'
At this point, you'd want to clean up the attributes into a friendly data structure. Since there could be arbitrarily many of them, it's going to be more convenient (and just as efficient) not to use regexps for this stage.
>>> [attr_str.split('=') for attr_str in m.group('attrs').split()]
[['attr1', '"foo"'], ['attr2', '"bar"'], ['attr3', '"zing"'], ['attr4', '"zang"'], ['attr5', '"zoom"']]