How can I create an array from a messy text file - python-2.7

I have a text file in the form below...
Some line of text
Some line of text
Some line of text
--
data entry 0 (i = 0, j = 0); value = 1.000000
data entry 1 (i = 0, j = 1); value = 1.000000
data entry 2 (i = 0, j = 2); value = 1.000000
data entry 3 (i = 0, j = 3); value = 1.000000
etc for quite a large number of lines. The total array ends up being 433 rows x 400 columns. There is a line of hyphens -- separating each new i value. So far I have the following code:
f = open('text_file_name', 'r')
lines = f.readlines()
which is simply opening the file and converting it to a list with each line as a separate string. I need to be able create an array with the given values for i and j positions - let's call the array A. The value of A[0,0] should be 1.000000. I don't know how I can get from a messy text file (at the stage I am, messy list) to a usable array
EDIT:
The expected output is a NumPy array. If I can get to that point, I can work through the rest of the tasks in the problem
UPDATE:
Thank you, Lukasz, for the suggestion below. I sort of understand the code you wrote, but I don't understand it well enough to use it. However, you have given me some good ideas on what to do. The data entries begin on line 12 of the text file. Values for i are within the 22nd and 27th character places, values for j are within the 33rd and 39th character places, and values for value are within the 49th and 62nd character places. I realize this is overly specific for this particular text file, but my professor is fine with that.
Now, I've written the following code using the formatting of this text file
for x in range(12,len(lines)):
if not lines[x].startswith(' data entry'):
continue
else:
i = int(lines[x][22:28])
j = int(lines[x][33:39])
r = int(lines[x][49:62])
matrix[i,j] = r
print matrix
and the following ValueError message is given:
r = int(lines[x][49:62])
ValueError: invalid literal for int() with base 10: '1.000000'
Can anyone explain why this is given (I should be able to convert the string '1.000000' to integer 1) and what I can do to correct the issue?

You may simply skip all lines that does not look like data line.
For retrieving indices simple regular expression is introduced.
import numpy as np
import re
def parse(line):
m = re.search('\(i = (\d+), j = (\d+)\); value = (\S+)', line)
if not m:
raise ValueError("Invalid line", line)
return int(m.group(1)), int(m.group(2)), float(m.group(3))
R = 433
C = 400
data_file = 'file.txt'
matrix = np.zeros((R, C))
with open(data_file) as f:
for line in f:
if not line.startswith('data entry'):
continue
i, j, v = parse(line)
matrix[i, j] = v
print matrix
Main trouble here is hardcoded matrix size. Ideally you' somehow detect a size of destination matrix prior to reading data, or use other data structure and rebuild numpy array from said structure.

Related

Binary files: write with C++, read with MATLAB

I could use your support on this. Here is my issue:
I've got a 2D buffer of floats (in a data object) in a C++ code, that I write in a binary file using:
ptrToFile.write(reinterpret_cast<char *>(&data->array[0][0]), nbOfEltsInArray * sizeof(float));
The data contains 8192 floats, and I (correctly ?) get a 32 kbytes (8192 * 4 bytes) file out of this line of code.
Now I want to read that binary file using MATLAB. The code is:
hdr_binaryfile = fopen(str_binaryfile_path,'r');
res2_raw = fread(hdr_binaryfile, 'float');
res2 = reshape(res2_raw, int_sizel, int_sizec);
But it's not happening as I expect it to happen. If I print the array of data in the C++ code using std::cout, I get:
pCarte_bin->m_size = 8192
pCarte_bin->m_sizel = 64
pCarte_bin->m_sizec = 128
pCarte_bin->m_p[0][0] = 1014.97
pCarte_bin->m_p[0][1] = 566946
pCarte_bin->m_p[0][2] = 423177
pCarte_bin->m_p[0][3] = 497375
pCarte_bin->m_p[0][4] = 624860
pCarte_bin->m_p[0][5] = 478834
pCarte_bin->m_p[1][0] = 2652.25
pCarte_bin->m_p[2][0] = 642077
pCarte_bin->m_p[3][0] = 5.33649e+006
pCarte_bin->m_p[4][0] = 3.80922e+006
pCarte_bin->m_p[5][0] = 568725
And on the MATLAB side, after I read the file using the little block of code above:
size(res2) = 64 128
res2(1,1) = 1014.9659
res2(1,2) = 323288.4063
res2(1,3) = 2652.2515
res2(1,4) = 457593.375
res2(1,5) = 642076.6875
res2(1,6) = 581674.625
res2(2,1) = 566946.1875
res2(3,1) = 423177.1563
res2(4,1) = 497374.6563
res2(5,1) = 624860.0625
res2(6,1) = 478833.7188
The size (lines, columns) is OK, as well as the very first item ([0][0] in C++ == [1][1] in MATLAB). But:
I'm reading the C++ line elements along the columns: [0][1] in C++ == [1][2] in MATLAB (remember that indexing starts at 1 in MATLAB), etc.
I'm reading one correct element out of two along the other dimension: [1][0] in C++ == [1][3] in MATLAB, [2][0] == [1][5], etc.
Any idea about this ?
Thanks!
bye
Leaving aside the fact there seems to be some precision difference (likely the display settings in MATLAB) the issue here is likely the difference between row major and column major ordering of data. Without more details it will be hard to be certain. In particular MATLAB is column major meaning that contiguous memory on disk is interpreted as detailing sequential elements in a column rather than a row.
The likely solution is to reverse the two sizes in your reshape, and access the elements with indices reversed. That is, swap the int_size1 and int_size2, and then read elements expecting
pCarte_bin->m_p[0][0] = res2(1,1)
pCarte_bin->m_p[0][1] = res2(2,1)
pCarte_bin->m_p[0][2] = res2(3,1)
pCarte_bin->m_p[0][3] = res2(4,1)
pCarte_bin->m_p[1][0] = res2(1,2)
etc.
You could also transpose the array in MATLAB after read, but for a large array that could be costly in itself

writer string and formula till the data available for particular range

I have data in excel and wanted to write a string with a sum for each group of the table.
So I wanted to loop through in a range and write a string "Subtotal" on the first column and apply formula '=sum{}:{}' from start row to before I'm writing a formula.
I know the start and end range.
How can I achieve that by using a loop where the first blank found write string and formula.
input:
See below code I'm trying but it does not work.
row_start = number_rows_placement + number_rows_adsize + 20
row_end = number_rows_placement + number_rows_adsize + number_rows_daily + unqiue_final_day_wise * 5 + 15
for i in range ( row_start , row_end ):
if i == " ":
worksheet.write(i,1,"Subtotal", format)
i += 5
worksheet.write_formula(i,2,'=sum(:)', format)
but it doesn't seem to be working. I don't know where I'm wrong. also while trying to get sum range would vary after each header to before the formula marked.
OutPut:
The formula isn't valid in Excel. It should be =SUM(), uppercase.
Also, you can generate the range for the formula with something like this:
from xlsxwriter.utility import xl_range
row_start = 60
row_end = 64
col = 1
cell_range = xl_range(row_start, col, row_end, col) # B61:B65
See the XlsxWriter Cell Utility Functions.

Why won't H5PYDataset.get_data() work within function?

I have a function which is supposed to unpack an H5PY dataset, temp.hdf5, which only contains one example so it can be evaluated:
def getprob():
test_set = H5PYDataset('temp.hdf5', which_sets=('test',))
handle = test_set.open()
test_data = test_set.get_data(handle, slice(0,1))
xx = test_data[0]
YY = test_data[1]
l, prob, rho_orig, rho_larger, rho_largest = f(xx)
return prob[9][0]
Where test_data[0] is 28x28 array of integers and test_data[1] is an integer between 0 and 9.
The problem is that, from within the function, test_data[0] is always a 28x28 array of zeros, even though it is not within 'temp.hdf5'. test_data[1] always loads properly, though.
When these lines of code are run outside of the function, everything works just fine.
What is going on here?

please help, python table file find max value

please help
I'm a beginner to python programming and my problem is this:
I have to make a program which first reads a text file like this one->
A a 1 2 (line one)
A b 3 5 (line two)
A c 9 1
B d 2 4
B e 9 2
C r 3 4
...
and find out: for each First Value (A, B, C, ...), which second value (a, b, c, ...) has max (third value)*(fourth value) (1*2, 3*5, ...) value.
that is, in this example the result should be b, e, r.
And I need to do it 1) without using dictionary class and saving each data
or 2) devise a class and object and do the same thing.
(actually I have to make this program twice by using either methods)
What I'am really confused about is... I made this program first by using dictionary, but I have no idea how to do it with any of those two certain methods mentioned above.
I did this by making dictionary[dictionary[value]] format and (saving each line's data), and found out which one has max value for first value.
How can I do this not on this particular way?
Especially is it even possible to do this on method 1)? (without using dictionary class and saving each data)
thank you for reading my question
I'm really just beginning to learn about this programming and if any of you could give me some advice it would be really appreciated
here is what I've done so far:
The below code works by storing the maximum values and doing comparisons with the values currently being read from the file. This code is not complete as it does not intentionally handle instances where two of the products are the same and it also does not handle an edge case that you should be able to find using your example inputs. I've left those for you to complete.
max_vals = []
with open('FILE.TXT', 'r') as f:
max_first_val = None
max_second_val = None
max_prod = 0
for line in f:
vals = line.strip('\n').split(' ')
curr_prod = int(vals[2]) * int(vals[3])
if vals[0] != max_first_val and max_first_val is not None:
max_vals.append(max_second_val)
max_first_val = vals[0]
max_prod = 0
if curr_prod > max_prod:
max_first_val = vals[0]
max_second_val = vals[1]
max_prod = curr_prod

Deep Learning IndexError: too many indices for array

I am trying to train the system on some data, Sound_Fc is a 16X1 float array.
for i in range(0,26983):
Block_coo = X[0,i]
Fc = Block_coo[4]
Sound_Fc = Fc[:,0]
Vib_Fc = Fc[:,1]
y = np.matrix([[1.0],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16]])
testX, trainY, testY) = train_test_split(
Sound_Fc, y, test_size = 0.33, random_state=42)
dbn = NeuralNet(
layers=[
('input', layers.InputLayer),
('hidden', layers.DenseLayer),
('output', layers.DenseLayer),
],
input_shape = (None, trainX.shape[0]),
hidden_num_units=8,
output_num_units=4,
output_nonlinearity=softmax,
update=nesterov_momentum,
update_learning_rate=0.3,
update_momentum=0.9,
regression=False,
max_epochs=5,
verbose=1,
)
dbn.fit(trainX,trainY)
But I'm getting this error
Warning (from warnings module):
File "C:\Users\Essam Seddik\AppData\Roaming\Python\Python27\site-packages\sklearn\cross_validation.py", line 399
% (min_labels, self.n_folds)), Warning)
Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=5.
Traceback (most recent call last):
File "C:\Essam Seddik\Deep Learning Python Tutorial\DNV_DeepLearn.py", line 77, in <module>
dbn.fit(trainX,trainY)
File "C:\Python27\lib\site-packages\nolearn-0.6adev-py2.7.egg\nolearn\lasagne\base.py", line 293, in fit
self.train_loop(X, y)
File "C:\Python27\lib\site-packages\nolearn-0.6adev-py2.7.egg\nolearn\lasagne\base.py", line 300, in train_loop
X, y, self.eval_size)
File "C:\Python27\lib\site-packages\nolearn-0.6adev-py2.7.egg\nolearn\lasagne\base.py", line 401, in train_test_split
kf = StratifiedKFold(y, round(1. / eval_size))
File "C:\Users\Essam Seddik\AppData\Roaming\Python\Python27\site-packages\sklearn\cross_validation.py", line 416, in __init__
label_test_folds = test_folds[y == label]
IndexError: too many indices for array
I tried xrange instead of range, and y=list() instead of the defined y. I tried also small numbers in the for loop range like 5, 10 and 100 instead of 26983.
I tried np.array and np.ndarray and np.atleast_2d. Nothing works !
At every iteration of that loop you are overwriting Sound_Fc. So, at the end of the loop, the value of Sound_Fc is X[0,26982][4][:,0]. You are also overwriting y with the same value over and over again at each iteration of the loop, it is basically a vector with values from 1 to 16. Basically, your total data is 16 points, and the y value of each is a unique value (something between 1 and 16). Then you split this into training and test data, so you make 5 of these 16 points your test set, and 11 of them your training set. With only a single example for each observed y output, your network model is complaining that it cannot extract enough information to predict these y values in the future.
If I understand correctly, instead of overwriting Sound_Fc and y at each iteration, you want to append them to growing x and y vectors. You can do this by using vstack which vertically stacks numpy arrays. Replace the following with that loop:
Sound_Fc = np.vstack( [X[0,i][4][:,0] for i in range(26983)] )
y = np.vstack([np.matrix(range(1,17)).T for i in range(26983)])
Before, the Sound_Fc had the shape (16,1) when you used it as your feature vector. Now it will have the shape (431728, 1). That number is 26983*16, since you're stacking 26983 vectors that have 16 elements each. Your y will have the shape (431728, 1).
[X[0,i][4][:,0] for i in range(26983)] creates a list of 26983 elements, each element is a (16,1) shape numpy array. np.vstack stacks them vertically to get a single, tall, (431728, 1) array. This is your feature vector.
np.matrix(range(1,17)) creates a matrix with the elements 1 to 16. This is of shape (1, 16). By taking its transpose with .T, we make it vertical and now its shape is (16,1). Again, we make a list from 26983 of these, and vstack them to get a (431728, 1) shape vector that basically goes from 1 to 16 and then goes to 1 again and to 16 again, basically repeats that 1-16 pattern over and over again. This is your output vector. Now, for each output (let's say 8, for instance), you have 26983 data points to learn from (well it will be 17809 once you split .66 of all this to be your training set) . Now your model will not complain about not having enough examples for a specific y output.
There might of course be other errors related to other stuff (I can't see your data -- I don't know what's in that big X).