Deep Learning IndexError: too many indices for array - python-2.7

I am trying to train the system on some data, Sound_Fc is a 16X1 float array.
for i in range(0,26983):
Block_coo = X[0,i]
Fc = Block_coo[4]
Sound_Fc = Fc[:,0]
Vib_Fc = Fc[:,1]
y = np.matrix([[1.0],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11],[12],[13],[14],[15],[16]])
testX, trainY, testY) = train_test_split(
Sound_Fc, y, test_size = 0.33, random_state=42)
dbn = NeuralNet(
layers=[
('input', layers.InputLayer),
('hidden', layers.DenseLayer),
('output', layers.DenseLayer),
],
input_shape = (None, trainX.shape[0]),
hidden_num_units=8,
output_num_units=4,
output_nonlinearity=softmax,
update=nesterov_momentum,
update_learning_rate=0.3,
update_momentum=0.9,
regression=False,
max_epochs=5,
verbose=1,
)
dbn.fit(trainX,trainY)
But I'm getting this error
Warning (from warnings module):
File "C:\Users\Essam Seddik\AppData\Roaming\Python\Python27\site-packages\sklearn\cross_validation.py", line 399
% (min_labels, self.n_folds)), Warning)
Warning: The least populated class in y has only 1 members, which is too few. The minimum number of labels for any class cannot be less than n_folds=5.
Traceback (most recent call last):
File "C:\Essam Seddik\Deep Learning Python Tutorial\DNV_DeepLearn.py", line 77, in <module>
dbn.fit(trainX,trainY)
File "C:\Python27\lib\site-packages\nolearn-0.6adev-py2.7.egg\nolearn\lasagne\base.py", line 293, in fit
self.train_loop(X, y)
File "C:\Python27\lib\site-packages\nolearn-0.6adev-py2.7.egg\nolearn\lasagne\base.py", line 300, in train_loop
X, y, self.eval_size)
File "C:\Python27\lib\site-packages\nolearn-0.6adev-py2.7.egg\nolearn\lasagne\base.py", line 401, in train_test_split
kf = StratifiedKFold(y, round(1. / eval_size))
File "C:\Users\Essam Seddik\AppData\Roaming\Python\Python27\site-packages\sklearn\cross_validation.py", line 416, in __init__
label_test_folds = test_folds[y == label]
IndexError: too many indices for array
I tried xrange instead of range, and y=list() instead of the defined y. I tried also small numbers in the for loop range like 5, 10 and 100 instead of 26983.
I tried np.array and np.ndarray and np.atleast_2d. Nothing works !

At every iteration of that loop you are overwriting Sound_Fc. So, at the end of the loop, the value of Sound_Fc is X[0,26982][4][:,0]. You are also overwriting y with the same value over and over again at each iteration of the loop, it is basically a vector with values from 1 to 16. Basically, your total data is 16 points, and the y value of each is a unique value (something between 1 and 16). Then you split this into training and test data, so you make 5 of these 16 points your test set, and 11 of them your training set. With only a single example for each observed y output, your network model is complaining that it cannot extract enough information to predict these y values in the future.
If I understand correctly, instead of overwriting Sound_Fc and y at each iteration, you want to append them to growing x and y vectors. You can do this by using vstack which vertically stacks numpy arrays. Replace the following with that loop:
Sound_Fc = np.vstack( [X[0,i][4][:,0] for i in range(26983)] )
y = np.vstack([np.matrix(range(1,17)).T for i in range(26983)])
Before, the Sound_Fc had the shape (16,1) when you used it as your feature vector. Now it will have the shape (431728, 1). That number is 26983*16, since you're stacking 26983 vectors that have 16 elements each. Your y will have the shape (431728, 1).
[X[0,i][4][:,0] for i in range(26983)] creates a list of 26983 elements, each element is a (16,1) shape numpy array. np.vstack stacks them vertically to get a single, tall, (431728, 1) array. This is your feature vector.
np.matrix(range(1,17)) creates a matrix with the elements 1 to 16. This is of shape (1, 16). By taking its transpose with .T, we make it vertical and now its shape is (16,1). Again, we make a list from 26983 of these, and vstack them to get a (431728, 1) shape vector that basically goes from 1 to 16 and then goes to 1 again and to 16 again, basically repeats that 1-16 pattern over and over again. This is your output vector. Now, for each output (let's say 8, for instance), you have 26983 data points to learn from (well it will be 17809 once you split .66 of all this to be your training set) . Now your model will not complain about not having enough examples for a specific y output.
There might of course be other errors related to other stuff (I can't see your data -- I don't know what's in that big X).

Related

How can I read from a .txt file to determine the size of a 2-D array?

I am working on a flood-fill recursion assignment where I have to read an ASCII art text file and fill it in. The assignment can be found here: https://faculty.utrgv.edu/robert.schweller/CS2380/homework/hw10.pdf
Recursion() //construcor
{
column = -1;
row = -1;
grid = new char*[size of art row];
for(int i = 0; i < size of art; i++)
{
board[row] = new char[size of art column]
}
}
I'm not sure if determining the size of the array should be in the constructor or not. I need to know the size of the array in order to know where the user wants to fill the art file in. Also, here is all of the code for a better context. https://pastebin.com/TSYH26Ci
I would handle your file as binary. I do not know which OS,API you are using so I will answer just in generic way...
get file size siz
usually seeking to 0 bytes from end of file will give you the file size.
allocate 1D array dat to store your entire file
dat = new BYTE[siz];
load your file into memory (1D array)
do not forget to use binary access as some ASCII arts could use control codes (ASCII below 32) which could be corrupted by text file access.
scan for end of line
so scan your array from 0 and stop when you find ASCII codes 13 or 10. Its position will give you the x resolution of your ASCII art
int xs;
for (xs=0;xs<siz;xs++)
if ((dat[xs]==10)||(dat[xs]==13))
break;
now xs should be holding your x resolution.
compute y resolution ys
the safest way is to count the number of end of lines (13 or 10). In such case you can even store the line start addresses to some pointer array BYTE **pixel=new (BYTE*)[ys]; which will enable you simple 2D access pixel[y][x]. If your ASCII art is aligned and have constant size per each line than you can compute ys from size..
ys = siz/(xs+eol_size)
where eol_size is 1 or 2 depending on the used line ending: ((10),(13),(13,10) or (10,13)) so:
eol_size=1;
if (xs<siz)
if ((dat[xs+1]==10)||(dat[xs+1]==13))
eol_size=2;
As we do not have access to any input file we can only guess... If you need to generate one see:
C++ Image to ASCII art conversion
Here example of binary file access in VCL (bullets #1,#2,#3):
Convert the Linux open, read, write, close functions to work on Windows

How can I create an array from a messy text file

I have a text file in the form below...
Some line of text
Some line of text
Some line of text
--
data entry 0 (i = 0, j = 0); value = 1.000000
data entry 1 (i = 0, j = 1); value = 1.000000
data entry 2 (i = 0, j = 2); value = 1.000000
data entry 3 (i = 0, j = 3); value = 1.000000
etc for quite a large number of lines. The total array ends up being 433 rows x 400 columns. There is a line of hyphens -- separating each new i value. So far I have the following code:
f = open('text_file_name', 'r')
lines = f.readlines()
which is simply opening the file and converting it to a list with each line as a separate string. I need to be able create an array with the given values for i and j positions - let's call the array A. The value of A[0,0] should be 1.000000. I don't know how I can get from a messy text file (at the stage I am, messy list) to a usable array
EDIT:
The expected output is a NumPy array. If I can get to that point, I can work through the rest of the tasks in the problem
UPDATE:
Thank you, Lukasz, for the suggestion below. I sort of understand the code you wrote, but I don't understand it well enough to use it. However, you have given me some good ideas on what to do. The data entries begin on line 12 of the text file. Values for i are within the 22nd and 27th character places, values for j are within the 33rd and 39th character places, and values for value are within the 49th and 62nd character places. I realize this is overly specific for this particular text file, but my professor is fine with that.
Now, I've written the following code using the formatting of this text file
for x in range(12,len(lines)):
if not lines[x].startswith(' data entry'):
continue
else:
i = int(lines[x][22:28])
j = int(lines[x][33:39])
r = int(lines[x][49:62])
matrix[i,j] = r
print matrix
and the following ValueError message is given:
r = int(lines[x][49:62])
ValueError: invalid literal for int() with base 10: '1.000000'
Can anyone explain why this is given (I should be able to convert the string '1.000000' to integer 1) and what I can do to correct the issue?
You may simply skip all lines that does not look like data line.
For retrieving indices simple regular expression is introduced.
import numpy as np
import re
def parse(line):
m = re.search('\(i = (\d+), j = (\d+)\); value = (\S+)', line)
if not m:
raise ValueError("Invalid line", line)
return int(m.group(1)), int(m.group(2)), float(m.group(3))
R = 433
C = 400
data_file = 'file.txt'
matrix = np.zeros((R, C))
with open(data_file) as f:
for line in f:
if not line.startswith('data entry'):
continue
i, j, v = parse(line)
matrix[i, j] = v
print matrix
Main trouble here is hardcoded matrix size. Ideally you' somehow detect a size of destination matrix prior to reading data, or use other data structure and rebuild numpy array from said structure.

Displaying a table with two columns in Fortran with available data

I have two variables say x and y and both have around 60 points in them(basically values of the x and y axis of the plot). Now when I try to display it in the result file in form of a column or a table with the x value and the corresponding y value I end up with all the x values displayed in both the columns followed then by the y values. I am unable to get it out correctly.
This is a small part of the code
xpts = PIC1(1,6:NYPIX,1)
ypts = PIC1(2,6:NYPIX,1)
write(21,*), NYPIX
write(21,"(T2,F10.4: T60,F10.4)"), xpts, ypts
This is the output I get. the x values continue from the column 1 to 2 till all are displayed and then the y values are displayed.
128.7018 128.7042
128.7066 128.7089
128.7113 128.7137
128.7160 128.7184
128.7207 128.7231
128.7255 128.7278
128.7302 128.7325
128.7349 128.7373
128.7396 128.7420
128.7444 128.7467
128.7491 128.7514
128.7538 128.7562
128.7585 128.7609
128.7633 128.7656
128.7680 128.7703
128.7727 128.7751
128.7774 128.7798
128.7822 128.7845
128.7869 128.7892
128.7916 128.7940
128.7963 128.7987
128.8011 128.8034
86.7117 86.7036
86.6760 86.6946
86.6317 86.6467
86.6784 86.8192
86.8634 87.0909
87.2584 87.6427
88.1245 88.8343
89.5275 90.2652
91.0958 91.8668
92.6358 93.2986
93.8727 94.4631
You could use a do loop:
do i=1,size(xpts)
write(21,"(T2,F10.4: T60,F10.4)"), xpts(i), ypts(i)
enddo
There is already an answer saying how to get the output as wanted. It may be good, though, to explicitly say why the (unwanted) output as in the question comes about.
In the (generalized) statement
write(unit,fmt) xpts, ypts
the xpts, ypts is the output list. In the description of how the output list is treated we see (Fortran 2008 9.6.3)
If an array appears as an input/output list item, it is treated as if the elements, if any, were specified in array element order
That is, it shouldn't be too surprising that (assuming the lower bound of xpts and ypts are 1)
write(unit, fmt) xpts(1), xpts(2), xpts(3), ..., ypts(1), ypts(2), ...
gives the output seen.
Using a do loop expanded as
write(unit, fmt) xpts(1), ypts(1)
write(unit, fmt) xpts(2), ypts(2)
...
is indeed precisely what is wanted here. However, a more general "give me the elements of the arrays interleaved" could be done with an output implied-do:
write(unit, fmt) (xpts(i), ypts(i), i=LBOUND(xpts,1),UBOUND(xpts,1))
(assuming that the upper and lower bounds of ypts are the same as xpts).
This is equivalent to
write(unit, fmt) xpts(1), ypts(1), xpts(2), ypts(2), ...
(again, for convenience switching to the assumption about lower bounds).
This implied-do may be more natural in some cases. In particular note that the first explicit do loop writes one record for each pair of elements from xpts and ypts; for the implied-do the new record comes about from format reversion. The two for the format in the question are equivalent, but for some more exotic formats the former may not be what is wanted and it ties the structure of the do loop to the format.
This splitting of records holds even more so for unformatted output (which hasn't format reversion).

Getting all combinations of splitting an array into two equally sized groups in Julia

Given an array of 20 numbers, I would like to extract all possible combinations of two groups, with ten numbers in each, order is not important.
combinations([1, 2, 3], 2)
in Julia will give me all possible combinations of two numbers drawn from the array, but I also need the ones that were not drawn...
You can use setdiff to determine the items missing from any vector, e.g.,
y = setdiff(1:5, [2,4])
yields [1,3,5].
After playing around for a bit, I came up with this code, which seems to work. I'm sure it could be written much more elegantly, etc.
function removeall!(remove::Array, a::Array)
for i in remove
if in(i, a)
splice!(a, indexin([i], a)[1])
end
end
end
function combinationgroups(a::Array, count::Integer)
result = {}
for i in combinations(a, count)
all = copy(a)
removeall!(i, all)
push!(result, { i; all } )
end
result
end
combinationgroups([1,2,3,4],2)
6-element Array{Any,1}:
{[1,2],[3,4]}
{[1,3],[2,4]}
{[1,4],[2,3]}
{[2,3],[1,4]}
{[2,4],[1,3]}
{[3,4],[1,2]}
Based on #tholy's comment about instead of using the actual numbers, I could use positions (to avoid problems with numbers not being unique) and setdiff to get the "other group" (the non-selected numbers), I came up with the following. The first function grabs values out of an array based on indices (ie. arraybyindex([11,12,13,14,15], [2,4]) => [12,14]). This seems like it could be part of the standard library (I did look for it, but might have missed it).
The second function does what combinationgroups was doing above, creating all groups of a certain size, and their complements. It can be called by itself, or through the third function, which extracts groups of all possible sizes. It's possible that this could all be written much faster, and more idiomatical.
function arraybyindex(a::Array, indx::Array)
res = {}
for e in indx
push!(res, a[e])
end
res
end
function combinationsbypos(a::Array, n::Integer)
res = {}
positions = 1:length(a)
for e in combinations(positions, n)
push!(res, { arraybyindex(a, e) ; arraybyindex(a, setdiff(positions, e)) })
end
res
end
function allcombinationgroups(a::Array)
maxsplit = floor(length(a) / 2)
res = {}
for e in 1:5
println("Calculating for $e, so far $(length(res)) groups calculated")
push!(res, combinationsbypos(a, e))
end
res
end
Running this in IJulia on a 3 year old MacBook pro gives
#time c=allcombinationgroups([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20])
println(length(c))
c
Calculating for 1, so far 0 groups calculated
Calculating for 2, so far 20 groups calculated
Calculating for 3, so far 210 groups calculated
Calculating for 4, so far 1350 groups calculated
Calculating for 5, so far 6195 groups calculated
Calculating for 6, so far 21699 groups calculated
Calculating for 7, so far 60459 groups calculated
Calculating for 8, so far 137979 groups calculated
Calculating for 9, so far 263949 groups calculated
Calculating for 10, so far 431909 groups calculated
elapsed time: 11.565218719 seconds (1894698956 bytes allocated)
Out[49]:
616665
616665-element Array{Any,1}:
{{1},{2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}}
{{2},{1,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20}}
⋮
{{10,12,13,14,15,16,17,18,19,20},{1,2,3,4,5,6,7,8,9,11}}
{{11,12,13,14,15,16,17,18,19,20},{1,2,3,4,5,6,7,8,9,10}}
ie. 53,334 groups calculated per second.
As a contrast, using the same outer allcombinationgroups function, but replacing the call to combinationsbypos with a call to combinationgroups (see previous answer), is 10x slower.
I then rewrote the array by index group using true or false flags as suggested by #tholy (I couldn't figure out how to get it work using [], so I used setindex! explicitly, and moved it into one function. Another 10x speedup! 616,665 groups in 1 second!
Final code (so far):
function combinationsbypos(a::Array, n::Integer)
res = {}
positions = 1:length(a)
emptyflags = falses(length(a))
for e in combinations(positions, n)
flag = copy(emptyflags)
setindex!(flag, true, e)
push!(res, {a[flag] ; a[!flag]} )
end
res
end
function allcombinationgroups(a::Array)
maxsplit = floor(length(a) / 2)
res = {}
for e in 1:maxsplit
res = vcat(res, combinationsbypos(a, e))
end
res
end

Dynamically Delete Elements WIthin an R loop

Ok guys, as requested, I will add more info so that you understand why a simple vector operation is not possible. It's not easy to explain in few words but let's see. I have a huge amount of points over a 2D space.
I divide my space in a grid with a given resolution,say, 100m. The main loop that I am not sure if it's mandatory or not (any alternative is welcomed) is to go through EACH cell/pixel that contains at least 2 points (right now I am using the method quadratcount within the package spatstat).
Inside this loop, thus for each one of this non empty cells, I have to find and keep only a maximum of 10 Male-Female pairs that are within 3 meters from each other. The 3-meter buffer can be done using the "disc" function within spatstat. To select points falling inside a buffer you can use the method pnt.in.poly within the SDMTools package. All that because pixels have a maximum capacity that cannot be exceeded. Since in each cell there can be hundreds or thousands of points I am trying to find a smart way to use another loop/similar method to:
1)go trough each point at a time 2)create a buffer a select points with different sex 3)Save the closest Male-Female (0-1) pair in another dataframe (called new_colonies) 4)Remove those points from the dataframe so that it shrinks and I don't have to consider them anymore 5) as soon as that new dataframe reaches 10 rows stop everything and go to the next cell (thus skipping all remaining points. Here is the code that I developed to be run within each cell (right now it takes too long):
head(df,20):
X Y Sex ID
2 583058.2 2882774 1 1
3 582915.6 2883378 0 2
4 582592.8 2883297 1 3
5 582793.0 2883410 1 4
6 582925.7 2883397 1 5
7 582934.2 2883277 0 6
8 582874.7 2883336 0 7
9 583135.9 2882773 1 8
10 582955.5 2883306 1 9
11 583090.2 2883331 0 10
12 582855.3 2883358 1 11
13 582908.9 2883035 1 12
14 582608.8 2883715 0 13
15 582946.7 2883488 1 14
16 582749.8 2883062 0 15
17 582906.4 2883317 0 16
18 582598.9 2883390 0 17
19 582890.2 2883413 0 18
20 582752.8 2883361 0 19
21 582953.1 2883230 1 20
Inside each cell I must run something according to what I explained above..
for(i in 1:dim(df)[1]){
new_colonies <- data.frame(ID1=0,ID2=0,X=0,Y=0)
discbuff <- disc(radius, centre=c(df$X[i], df$Y[i]))
#define the points and polygon
pnts = cbind(df$X[-i],df$Y[-i])
polypnts = cbind(x = discbuff$bdry[[1]]$x, y = discbuff$bdry[[1]]$y)
out = pnt.in.poly(pnts,polypnts)
out$ID <- df$ID[-i]
if (any(out$pip == 1)) {
pnt.inBuffID <- out$ID[which(out$pip == 1)]
cond <- df$Sex[i] != df$Sex[pnt.inBuffID]
if (any(cond)){
eucdist <- sqrt((df$X[i] - df$X[pnt.inBuffID][cond])^2 + (df$Y[i] - df$Y[pnt.inBuffID][cond])^2)
IDvect <- pnt.inBuffID[cond]
new_colonies_temp <- data.frame(ID1=df$ID[i], ID2=IDvect[which(eucdist==min(eucdist))],
X=(df$X[i] + df$X[pnt.inBuffID][cond][which(eucdist==min(eucdist))]) / 2,
Y=(df$Y[i] + df$Y[pnt.inBuffID][cond][which(eucdist==min(eucdist))]) / 2)
new_colonies <- rbind(new_colonies,new_colonies_temp)
if (dim(new_colonies)[1] == maxdensity) break
}
}
}
new_colonies <- new_colonies[-1,]
Any help appreciated!
Thanks
Francesco
In your case I wouldn't worry about deleting the points as you go, skipping is the critical thing. I also wouldn't make up a new data.frame piece by piece like you seem to be doing. Both of those things slow you down a lot. Having a selection vector is much more efficient (perhaps part of the data.frame, that you set to FALSE beforehand).
df$sel <- FALSE
Now, when you go through you set df$sel to TRUE for each item you want to keep. Just skip to the next cell when you find your 10. Deleting values as you go will be time consuming and memory intensive, as will slowly growing a new data.frame. When you're all done going through them then you can just select your data based on the selection column.
df <- df[ df$sel, ]
(or maybe make a copy of the data.frame at that point)
You also might want to use the dist function to calculate a matrix of distances.
from ?dist
"This function computes and returns the distance matrix computed by using the specified distance measure to compute the distances between the rows of a data matrix."
I'm assuming you are doing something sufficiently complicated that the for-loop is actually required...
So here's one rather simple approach: first just gather the rows to delete (or keep), and then delete the rows afterwards. Typically this will be much faster too since you don't modify the data.frame on each loop iteration.
df <- generateTheDataFrame()
keepRows <- rep(TRUE, nrow(df))
for(i in seq_len(nrow(df))) {
rows <- findRowsToDelete(df, df[i,])
keepRows[rows] <- FALSE
}
# Delete afterwards
df <- df[keepRows, ]
...and if you really need to work on the shrunk data in each iteration, just change the for-loop part to:
for(i in seq_len(nrow(df))) {
if (keepRows[i]) {
rows <- findRowsToDelete(df[keepRows, ], df[i,])
keepRows[rows] <- FALSE
}
}
I'm not exactly clear on why you're looping. If you could describe what kind of conditions you're checking there might be a nice vectorized way of doing it.
However as a very simple fix have you considered looping through the dataframe backwards?