Possible using python struct.unpack to keep specific values grouped? - python-2.7

I created a parser for some complex binary files using numpy.fromfile and defining the various dtypes necessary for reading each portion of the binary file. The resulting numpy array was then placed into a pandas dataframe and the same dtype that was defined for converting the binary files into the numpy array was recycled to define the column names for the pandas dataframe.
I was hoping to replicate this process using python struct but ran into an issue. If part of my structure requires a value to be a group of 3 ints, I can define the dtype as numpy.dtype([('NameOfField', '>i4', 3)]) and the returned value from the binary file is [int, int, int]. Can this be replicated using struct or do I need to regroup the values in the returned tuple based on the dtype before ingesting it into my pandas dataframe ?? I have read the python struct documentation and have not noticed any examples of this.
When using a dtype of >3i returns a result of int, int, int instead of [int, int, int] like I need.
Edit ...
Below is a generic example. This method using numpy.fromfile works perfect but is slow when working on my huge binary files so I am trying to implement using struct
import numpy as np
import pandas as pd
def example_structure():
dt = np.dtype([
('ExampleFieldName', '>i4', 3)
])
return dt
# filename of binary file
file_name = 'example_binary_file'
# define the dtype for this chunk of binary data
d_type = example_structure()
# define initial index for the file in memory
start_ind = 0
end_ind = 0
# read in the entire file generically
x = np.fromfile(file_name, dtype='u1')
# based on the dtype find the chunk size
chunk_size = d_type.itemsize
# define the start and end index based on the chunk size
start_ind = end_ind
end_ind = chunk_size + start_ind
# extract just the first chunk
temp = x[start_ind:end_ind]
# cast the chunk as the defined dtype
temp.dtype = d_type
# store the chunk in its own pandas dataframe
example_df = pd.DataFrame(temp.tolist(), columns=temp.dtype.names)
This will return a temp[0] value of [int, int, int] that will then be read into the pandas dataframe as a single entry under the column ExampleFieldName. If I attempt to replicate this using struct the temp[0] value is int, int, int, which is not be read properly into pandas. Is there a way to make struct group values like I can do using numpy ??

I'd suggest just splitting it up into a list of objects after unpacking it. It won't be as fast as numpy for huge objects, but then that's why numpy exists :P. Assuming data holds your raw bytes that you want to split into groups of 5 uint32_ts (obviously the data must also be in this shape):
import struct
output = struct.unpack("5I"*int(len(data)//5), data)
output = [data[i:i+5] for i in range(0, len(data), 5)]
Of course, this means iterating the data twice, and since struct.unpack doesn't yield successive values (afaik), doing it in one line won't help that. It'd maybe be faster to iterate over the data directly - I haven't run any tests - like this:
import struct
output, itemsize = [], struct.calcsize("5I")
for i in range(0, len(data), itemsize):
output.append(struct.unpack("5I", data[i:i+itemsize])

Related

Convert SList to Dataframe

I am reading data from a binary .out file using a python module "SWMMToolbox." The command to read the infilration time series for RG1 from the file.out is as follows:
x = !swmmtoolbox extract 'file.out' subcatchment,RG1,Infiltration_loss
See link for details about swmmtoolbox.
The data type of 'x' is a 'IPython.utils.text.SList'
The data looks like this:
I would like to import this Slist into pandas, but am having trouble. I want to get the datetime string as one column and the value after the comma as another. However, when I use
df = pd.DataFrame(data=x)
I get the following:
I also tried to use
df = pd.DataFrame.from_records(x)
but get this:
I tried to use pd.read_csv, but I couldn't get it to work since 'x' is a variable and not a file.
Any suggestions are much appreciated.

'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION' in Numba

I haven't seen this specific scenario in my research for this error in Numba. This is my first time using the package so it might be something obvious.
I have a function that calculates engineered features in a data set by adding, multiplying and/or dividing each column in a dataframe called data and I wanted to test whether numba would speed it up
#jit
def engineer_features(engineer_type,features,joined):
#choose which features to engineer (must be > 1)
engineered = features
if len(engineered) > 1:
if 'Square' in engineer_type:
sq = data[features].apply(np.square)
sq.columns = map(lambda s:s + '_^2',features)
for c1,c2 in combinations(engineered,2):
if 'Add' in engineer_type:
data['{0}+{1}'.format(c1,c2)] = data[c1] + data[c2]
if 'Multiply' in engineer_type:
data['{0}*{1}'.format(c1,c2)] = data[c1] * data[c2]
if 'Divide' in engineer_type:
data['{0}/{1}'.format(c1,c2)] = data[c1] / data[c2]
if 'Square' in engineer_type and len(sq) > 0:
data= pd.merge(data,sq,left_index=True,right_index=True)
return data
When I call it with lists of features, engineer_type and the dataset:
engineer_type = ['Square','Add','Multiply','Divide']
df = engineer_features(engineer_type,features,joined)
I get the error: Failed at object (analyzing bytecode)
'DataFlowAnalysis' object has no attribute 'op_MAKE_FUNCTION'
Same question here. I think the problem might be the lambda function since numba does not support function creation.
I had this same error. Numba doesnt support pandas. I converted important columns from my pandas df into bunch of arrays and it worked successfully under #JIT.
Also arrays are much faster then pandas df, incase you need it for processing large data.

converting python pandas column to numpy array in place

I have a csv file in which one of the columns is a semicolon-delimited list of floating point numbers of variable length. For example:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
when I read this into a pandas DataFrame, the datatype for that column is object. I want to convert it, ideally in place, to a numpy array (or just a regular float array, it doesn't matter too much at this stage).
I wrote a little function which takes a single one of those list elements and converts it to a numpy array:
def parse_list(data):
data_list = data.split(';')
return np.array(map(float, data_list))
This works fine, but what I want to do is do this conversion directly in the DataFrame so that I can use pandasql and the like to manipulate the whole data set after the conversion. Can someone point me in the right direction?
EDIT: I seem to have asked the question poorly. I would like to convert the following data frame:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
where the dtype of List is 'object'
to the following dataframe:
Index List
0 [900.0, 300.0, 899.2]
1 [123.4, 887.3, 900.1, 985.3]
where the datatype of List is numpy array of floats
EDIT2: some progress, thanks to the first answer. I now have the line:
df['List'] = df['List'].str.split(';')
which splits the column in place into an array, but the dtypes remain object When I then try to do
df['List'] = df['List'].astype(float)
I get the error:
return arr.astype(dtype)
ValueError: setting an array element with a sequence.
If I understand you correctly, you want to transform your data from pandas to numpy arrays.
I used this:
pandas_DataName.as_matrix(columns=None)
And it worked for me.
For more information visit here
I hope this could help you.

Adding data to a Pandas dataframe

I have a dataframe that contains Physician_Profile_City, Physician_Profile_State and Physician_Profile_Zip_Code. I ultimately want to stratify an analysis based on state, but unfortunately not all of the Physician_Profile_States are filled in. I started looking around to try and figure out how to fill in the missing States. I came across the pyzipcode module which can take as an input a zip code and returns the state as follows:
In [39]: from pyzipcode import ZipCodeDatabase
zcdb = ZipCodeDatabase()
zcdb = ZipCodeDatabase()
zipcode = zcdb[54115]
zipcode.state
Out[39]: u'WI'
What I'm struggling with is how I would iterate through the dataframe and add the appropriate "Physician_Profile_State" when that variable is missing. Any suggestions would be most appreciated.
No need to iterate if the form of the data is a dict then you should be able to perform the following:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].map(zcdb)
Otherwise you can call apply like so:
df['Physician_Profile_State'] = df['Physician_Profile_Zip_Code'].apply(lambda x: zcdb[x].state)
In the case where the above won't work as it can't generate a Series to align with you df you can apply row-wise passing axis=1 to the df:
df['Physician_Profile_State'] = df[['Physician_Profile_Zip_Code']].apply(lambda x: zcdb[x].state, axis=1)
By using double square brackets we return a df allowing you to pass the axis param

Unpacking data in python in struct library

When I pack the data to fixed length and then while unpacking I am unable to retrieve the data with out mentioning the actual length of the data.
How do I retrieve only data without the \x00 characters without calculating the length in prior.
>>> import struct
>>> with open("forums_file.dat", "w") as file:
file.truncate(1024)
>>> country = 'india'
>>> data = struct.pack('20s', country)
>>> print data
india
>>> data
'india\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
>>> print len(data)
20
>>> unpack_data = struct.unpack('5s', country)
>>> unpack_data
('india',)
In the above code snippet I had mentioned the length of the data(5s) while unpacking.
Short answer: You can't do it directly.
Longer answer:
The more indirect solution is actually not that bad. When unpacking the string, you use the same length as you used for packing. That returns the string including the NUL chars (0 bytes).
Then you split on the NUL char and take the first item, like so:
result_with_NUL, = struct.unpack('20s', data)
print(repr(result_with_NUL))
result_string = result_with_NUL.split('\x00', 1)[0]
print(repr(result_string))
The , 1 parameter in split() is not strictly necessary, but makes it more efficient, as it splits only on the first occurrence of NUL instead of every single one.
Also note that when packing and unpacking with the goal to read/write files or exchange data with different systems, it's important to explicitly precede your format strings with "<" or ">" (or "=" in certain very special cases), both for packing and unpacking, since otherwise it will align and pad the structures, which is heavily system dependent and might cause hard to find bugs later.