Access indices from 2D list - Python - python-2.7

I'm trying to access a list of indices from a 2D list with the following error. Basically I want to find where my data is between two values, and set a 'weights' array to 1.0 to use for a later calculation.
#data = numpy array of size (141,141)
weights = np.zeros([141,141])
ind = [x for x,y in enumerate(data) if y>40. and y<50.]
weights[ind] = 1.0
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I've tried using np.extract() but that doesn't give indices...

Think I got it to work doing this:
#data = numpy array of size (141,141)
weights = np.zeros([141,141])
ind = ((data > 40.) & (data < 50.)).astype(float)
weights[np.where(ind==1)]=1.0
thanks to the helpful comment about using numpy's vectorizing capability. The third line outputs an array of size (141,141) of 1's where the conditions are met, and 0's where it fails. Then I filled my 'weights' array with 1.0s at those locations.

If you need to fill weights with ( (value - 40) / 10), then Use numpy.ma is better:
data = np.random.uniform(0, 100, size=(141, 141))
weights = ((np.ma.masked_outside(data, 40, 50) - 40) / 10).filled(0)

Related

Finding random index with specific value in large numpy array

I have a very large 2D numpy array (~5e8 values). I have labeled that array using scipy.ndimage.label I then want to find a random index of the flattened array that contains each label. I can do this with:
import numpy as np
from scipy.ndimage import label
base_array = np.random.randint(0, 5, (100000, 5000))
labeled_array, nlabels = label(base_array)
for label_num in xrange(1, nlabels+1):
indices = np.where(labeled_array.flat == label_num)[0]
index = np.random.choice(indices)
But, it is slow with an array this large. I have also tried replacing the np.where with:
indices = np.argwhere(labeled_array.flat == label).squeeze()
And found it to be slower. I have a suspicion that the boolean masking is the slow part. Is there anyway to speed this up, or a better way to do this. I will say in my real application the array is fairly sparse with about 25% fill, though I have no experience with scipy's sparse array functions.
Your suspicion that masking separately for each label is expensive is correct, because no matter how you do it the masking will always be O(n).
We can circumvent this by argsorting by label and then randomly picking from each block of equal labels.
Since the labels are an integer range we can get the argsort cheaper than np.argsort by using some sparse matrix machinery available in scipy.
As my machine doesn't have an awful lot of ram I had to shrink your example a bit (factor 4). It then runs in about 5 seconds.
import numpy as np
from scipy.ndimage import label
from scipy import sparse
def multi_randint(bins):
"""draw one random int from each range(bins[i], bins[i+1])"""
high = np.diff(bins)
n = high.size
pick = np.random.randint(0, 1<<30, (n,))
reject = np.flatnonzero(pick + (1<<30) % high >= (1<<30))
while reject.size:
npick = np.random.randint(0, 1<<30, (reject.size,))
rejrej = npick + (1<<30) % sizes[reject] >= (1<<30)
pick[reject] = npick
reject = reject[rejrej]
return bins[:-1] + pick % high
# build mock data, note that I had to shrink by 4x b/c memory
base_array = np.random.randint(0, 5, (50000, 2500), dtype=np.int8)
labeled_array, nlabels = label(base_array)
# build auxiliary sparse matrix
h = sparse.csr_matrix(
(np.ones(labeled_array.size, bool), labeled_array.ravel(),
np.arange(labeled_array.size+1, dtype=np.int32)),
(labeled_array.size, nlabels+1))
# conversion to csc argsorts the labels (but cheaper than argsort)
h = h.tocsc()
# draw
result = h.indices[multi_randint(h.indptr)]
# check result
assert len(set(labeled_array.ravel()[result])) == nlabels+1

Truth error while trying to find all equal minimum values in array, then retrieve indices

I'm trying to find all the minimum values in an array and retrieve their indices.
import numpy as np
a = np.array([[1,2],[1,4]])
minE = np.min(a)
ax,ay = np.unravel_index(minE, a.shape)
only returns minE = 1, ax, ay = 0 1
Can anyone help me out in a way that would also provide indices for all equal value minima (here, indices for both 1's)?
Were you looking for this:
x = np.array([[1,2,3],[1,4,2]])
np.where(x == np.amin(x))

Add extra feature to a matrix np.Concatenate error : only length-1 arrays can be converted to Python scalars

I want to add an extra column to my matrix in order to predict some features with some machine learning algorithms.
My trainSet got 8899 rows and 11 dimensions.
All i want to do is to add the extra dimension distance (see code).
But i got an error :
only length-1 arrays can be converted to Python scalars
temp_train_long/lat is (8899L,)
X_train = df_train.as_matrix()
temp_train_long=(X_train[:,3] - X_train[:,7])**2#long
temp_train_lat = (X_train[:,4] - X_train[:,8])**2#lat
distance = np.sqrt(temp_train_long + temp_train_lat)
np.concatenate(X_train, distance.T)
Review the concatenate docs
concatenate((a1, a2, ...), axis=0)
The function takes 2 arguments. The first is a list or tuple, the arrays that you want to join. The second is a number, denoting the axis. And it returns a new array. It does not operate in place.
X_train = df_train.as_matrix()
So this is 2d (8899, n), n larger than 9. According to pd documentation this is a numpy array not a numpy matrix (that's important)
temp_train_long=(X_train[:,3] - X_train[:,7])**2#long
temp_train_lat = (X_train[:,4] - X_train[:,8])**2#lat
Two 1d arrays (8899,)
distance = np.sqrt(temp_train_long + temp_train_lat)
Also (8899,). distance.T does nothing; that is not change in shape
np.concatenate(X_train, distance.T)
You give it 2 arguments, one is the 2d array, the other, in the axis slow is a 1d array.
You probably want
new_train = np.concatenate((X_train, distance[:,None]), axis=1)
2 array in one tuple, axis is scalar. the distance array has been turned into a 2d 1 column array.

getting axes don't match array error when trying to visualize all layers in caffe using classification.ipny

I am a newbie in python and have a very basic knowledge of the language, having said that, I'm trying to get the visualization for all layers both for weights and their filters.For this instead of repeating:
# the parameters are a list of [weights, biases]
filters = net.params['conv1'][0].data
vis_square(filters.transpose(0, 2, 3, 1))
and changing layer name, I tried using a loop like this :
for layer_name, param in net.params.iteritems():
# the parameters are a list of [weights, biases]
filters = net.params[layer_name][0].data
vis_square(filters.transpose(0, 2, 3, 1))
now it works fine for the first layer, but gives this error and stops running:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-16-cf7d5999a45c> in <module>()
2 # the parameters are a list of [weights, biases]
3 filters = net.params[layer_name][0].data
----> 4 vis_square(filters.transpose(0, 2, 3, 1))
ValueError: axes don't match array
And this is the definition of vis_square() (defined in classification.ipny in example directory of caffe):
def vis_square(data):
"""Take an array of shape (n, height, width) or (n, height, width, 3)
and visualize each (height, width) thing in a grid of size approx. sqrt(n) by sqrt(n)"""
# normalize data for display
data = (data - data.min()) / (data.max() - data.min())
# force the number of filters to be square
n = int(np.ceil(np.sqrt(data.shape[0])))
padding = (((0, n ** 2 - data.shape[0]),
(0, 1), (0, 1)) # add some space between filters
+ ((0, 0),) * (data.ndim - 3)) # don't pad the last dimension (if there is one)
data = np.pad(data, padding, mode='constant', constant_values=1) # pad with ones (white)
# tile the filters into an image
data = data.reshape((n, n) + data.shape[1:]).transpose((0, 2, 1, 3) + tuple(range(4, data.ndim + 1)))
data = data.reshape((n * data.shape[1], n * data.shape[3]) + data.shape[4:])
plt.imshow(data); plt.axis('off')
What is wrong here?
I'd be grateful if anyone could give me a hand on this.
For subsequent layers, the number of channels is > 64. For instance, if you have num_output: 64 in the first layer and num_output: 64 in the second as well, the shape of the 4D matrix that stores the weights is 64 x 64 x height x width. After you do the transpose, it's 64 x height x width x 64.
Your function is not capable of handling a 64 layer object, though it's great for 3-layer objects.
I would just do n = int(np.ceil(np.sqrt(data.shape[0] * data.shape[3]))) and reshape the whole thing into a 1-layer object. I don't think visualising the convolution kernel as RGB will give you any insight.
For anyone having similar problem ("axes don't match array" error): Right before transposing, I put my data in a variable giving the exact size. If my data is Data with the size of 10*12*15:
DataI = Data [0:9, 0:11, 0:14]
DataII = np.transpose(DataI,(0,2,1))
this worked for me.

Vectorized version of array calculation

Is there a way of vectorizing the following array calculation (i.e. without using for loops):
for i in range(numCells):
z[i] = ((i_mask == i)*s_image).sum()/pixel_counts[i]
s_image is an image stored as a 2-dimensional ndarray (I removed the colour dimension here for simplicity). i_mask is also a 2-dimensional array of the same size as s_image but it contains integers which are indexes to a list of 'cells' of length numCells. The result, z, is a 1-dimensional array of length numCells. The purpose of the calculation is to sum all the pixel values where the mask contains the same index and put the results in the z vector. (pixel_counts is also a 1-dimensional array of length numCells).
As one vectorized approach, you can take advantage of broadcasting and matrix-multiplication, like so -
# Generate a binary array of matches for all elements in i_mask against
# an array of indices going from 0 to numCells
matches = i_mask.ravel() == np.arange(numCells)[:,None]
# Do elementwise multiplication against s_image and sum those up for
# each such index going from 0 to numCells. This is essentially doing
# matix multiplicatio. Finally elementwise divide by pixel_counts
out = matches.dot(s_image.ravel())/pixel_counts
Alternatively, as another vectorized approach, you can do those multiplication and summation with np.einsum as well, which might give a boost to the performance, like so -
out = np.einsum('ij,j->i',matches,s_image.ravel())/pixel_counts
Runtime tests -
Function definitions:
def vectorized_app1(s_image,i_mask,pixel_counts):
matches = i_mask.ravel() == np.arange(numCells)[:,None]
return matches.dot(s_image.ravel())/pixel_counts
def vectorized_app2(s_image,i_mask,pixel_counts):
matches = i_mask.ravel() == np.arange(numCells)[:,None]
return np.einsum('ij,j->i',matches,s_image.ravel())/pixel_counts
def org_app(s_image,i_mask,pixel_counts):
z = np.zeros(numCells)
for i in range(numCells):
z[i] = ((i_mask == i)*s_image).sum()/pixel_counts[i]
return z
Timings:
In [7]: # Inputs
...: numCells = 100
...: m,n = 100,100
...: pixel_counts = np.random.rand(numCells)
...: s_image = np.random.rand(m,n)
...: i_mask = np.random.randint(0,numCells,(m,n))
...:
In [8]: %timeit org_app(s_image,i_mask,pixel_counts)
100 loops, best of 3: 8.13 ms per loop
In [9]: %timeit vectorized_app1(s_image,i_mask,pixel_counts)
100 loops, best of 3: 7.76 ms per loop
In [10]: %timeit vectorized_app2(s_image,i_mask,pixel_counts)
100 loops, best of 3: 4.08 ms per loop
Here is my solution (with all three colours handled). Not sure how efficient this is. Anyone got a better solution?
import numpy as np
import pandas as pd
# Unravel the mask matrix into a 1-d array
i = np.ravel(i_mask)
# Unravel the image into 1-d arrays for
# each colour (RGB)
r = np.ravel(s_image[:,:,0])
g = np.ravel(s_image[:,:,1])
b = np.ravel(s_image[:,:,2])
# prepare a dictionary to create the dataframe
data = {'i' : i, 'r' : r, 'g' : g, 'b' : b}
# create a dataframe
df = pd.DataFrame(data)
# Use pandas pivot table to average the colour
# intensities for each cell index value
pixAvgs = pd.pivot_table(df, values=['r', 'g', 'b'], index='i')
pixAvgs.head()
Output:
b g r
i
-1 26.719482 68.041868 101.603297
0 75.432432 170.135135 202.486486
1 92.162162 184.189189 208.270270
2 71.179487 171.897436 201.846154
3 76.026316 178.078947 211.605263
In the end I solved this problem a different way and it drastically increased the speed. Instead of using i_mask as above, a 2-dimensional array of indices into the 1-d array of output intensities, z, I created a different array, mask1593, of dimensions (numCells x 45). Each row is a list of about 35 to 45 indices into the flattened 256x256 pixel image (0 to 65536).
In [10]: mask1593[0]
Out[10]:
array([14853, 14854, 15107, 15108, 15109, 15110, 15111, 15112, 15363,
15364, 15365, 15366, 15367, 15368, 15619, 15620, 15621, 15622,
15623, 15624, 15875, 15876, 15877, 15878, 15879, 15880, 16131,
16132, 16133, 16134, 16135, 16136, 16388, 16389, 16390, 16391,
16392, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)
Then I was able to achieve the same transformation as follows using numpy's advanced indexing:
def convert_image(self, image_array):
"""Convert 256 x 256 RGB image array to 1593 RGB led intensities."""
global mask1593
shape = image_array.shape
img_data = image_array.reshape(shape[0]*shape[1], shape[2])
return np.mean(img_data[mask1593], axis=1)
And here is the result! A 256x256 pixel colour image transformed into an array of 1593 colours for display on this irregular LED display: