How to use the dimension of a python matrix in a loop - python-2.7

I am working with a matrix, lets call it X, in python.
I know how to get the dimension of the matrix using X.shape but I am interested specially on using the number of rows of the matrix in a for loop, and I dont know how to get this value in a datatype suitable for a loop.
For example, imagine tihs simple situation:
a = np.matrix([[1,2,3],[4,5,6]])
for i in 1:(number of rows of a)
print i
How can I get automatically that "number of rows of a"?

X.shape[0] == number of rows in X

A superficial search on numpy will lead you to shape. It returns a tuple of array dimensions.
In your case, the first dimension (axe) concerns the columns. You can access it as you access a tuple's element:
import numpy as np
a = np.matrix([[1,2,3],[4,5,6]])
# a. shape[1]: columns
for i in range(0,a.shape[1]):
print 'column '+format(i)
# a. shape[0]: rows
for i in range(0, a.shape[0]):
print 'row '+format(i)
This will print:
column 0
column 1
column 2
row 0
row 1

Related

How to shuffle rows of 3 dimensional numpy array?

I have 3 dimensional array having 3 rows(sample), 2 columns and 4 features. I want to shuffle three sample. But the following command gives error that only size 1 arrays can be converted to python lists. How can I do that?
`x= np.arange(3*2*4).reshape(3,2,4)
perm = np.arange(x[0])`
As the other answer has mentioned, you need to use shuffle, and it doesn't touch anything other than the first axis (your samples) according to the documentation.
This function only shuffles the array along the first axis of a multi-dimensional array.
Let's take the following example:
x = np.arange(2 * 3).reshape([3, 2])
perm = np.arange(x.shape[0])
As well you can print x.shape and see the result tuple (3, 2), so its zero's element is 3.

Join strings from the same column in ´pandas´ using a placeholder condition

I have a series of data that I need to filter.
The df consists of one col. of information that is separated by a row with with value NaN.
I would like to join all of the rows that occur until each NaN in a new column.
For example my data looks something like:
the
car
is
red
NaN
the
house
is
big
NaN
the
room
is
small
My desired result is
B
the car is red
the house is big
the room is small
Thus far, I am approaching this problema by building a function and applying it to each row in my dataframe. See below for my working code example so far.
def joinNan(row):
newRow = []
placeholder = 'NaN'
if row is not placeholder:
newRow.append(row)
if row == placeholder:
return newRow
df['B'] = df.loc[0].apply(joinNan)
For some reason, the first row of my data is being used as the index or column title, hence why I am using 'loc[0]' here instead of a specific column name.
If there is a more straight forward way to approach this directly iterating in the column, I am open for that suggestion too.
For now, I am trying to reach my desired solution and have not found any other similiar case in Stack overflow or the web in general to help me.
I think for test NaNs is necessary use isna, then greate helper Series by cumsum and aggregate join with groupby:
df=df.groupby(df[0].isna().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
#for oldier version of pandas
df=df.groupby(df[0].isnull().cumsum())[0].apply(lambda x: ' '.join(x.dropna())).to_frame('B')
Another solution is filter out all NaNs before groupby:
mask = df[0].isna()
#mask = df[0].isnull()
df['g'] = mask.cumsum()
df = df[~mask].groupby('g')[0].apply(' '.join).to_frame('B')

Data input for K means clustering with Scipy, Python?

I have a point dataset with two attributes and I would like to cluster these points based on the attribute values. I want to use K means clustering but I am unsure on how my input data should look like when using Scipy's implementation.
For example should I make a numpy array with each row containing: FID, attribute 1, attribute 2, x-coord, y-coord, or an array of just the attribute values? The attributes are integers and floats.
Each row in your data should be descrete observations and columns should correspond to features or dimensions of your data. For your case: FID, attribute 1, attribute 2, x-coord, y-coord should be on columns and each row should represent observations at different time steps.
from scipy.cluster.vq import kmeans,vq
nbStates = 4
Centers, _ = kmeans(Data, nbStates)
Data_id, _ = vq(Data, Centers)
where Data should be Nx5 matrix where 5 columns should correspond to your 5 features FID, attribute 1, attribute 2, x-coord, y-coord, and N rows corresponding to N observations. In other words reshape your FID data array as column vector and same for other features and horizontally concatenate them and put it as an argument for kmeans function. nbStates represents number of clusters which you expect to see, it should be set up beforehand. What you will get as a result is Centers which is NxM matrix where N corresponds to clusters and M corresponds to number of features in your data. Data_id matrix is a column vector which represents the labels of your data points corresponding to each cluster. It is Nx1 matrix where N is a number of data points.
If you want to cluster solely on the attributes you should create a 2xN matrix (according to the scipy docs), with your attributes as columns and each datapoint as row.
You will probably enhance your results by whitening (normalizing) the data points. Assuming your data have two fields attr1 and attr2 and you have a list dataset containing them the corresponding code whould look like:
from scipy.cluster.vq import kmeans, whiten
data = np.ndarray((2, len(dataset))
for row, d in enumerate(dataset):
data[0, row] = d.attr1
data[1, row] = d.attr2
whitened_data = np.whiten(data)
clusters, _ = scipy.cluster.vq.kmeans(data, 5) # 5 is the number of clusters you assume
assignments, _ = vq(data, clusters)

converting python pandas column to numpy array in place

I have a csv file in which one of the columns is a semicolon-delimited list of floating point numbers of variable length. For example:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
when I read this into a pandas DataFrame, the datatype for that column is object. I want to convert it, ideally in place, to a numpy array (or just a regular float array, it doesn't matter too much at this stage).
I wrote a little function which takes a single one of those list elements and converts it to a numpy array:
def parse_list(data):
data_list = data.split(';')
return np.array(map(float, data_list))
This works fine, but what I want to do is do this conversion directly in the DataFrame so that I can use pandasql and the like to manipulate the whole data set after the conversion. Can someone point me in the right direction?
EDIT: I seem to have asked the question poorly. I would like to convert the following data frame:
Index List
0 900.0;300.0;899.2
1 123.4;887.3;900.1;985.3
where the dtype of List is 'object'
to the following dataframe:
Index List
0 [900.0, 300.0, 899.2]
1 [123.4, 887.3, 900.1, 985.3]
where the datatype of List is numpy array of floats
EDIT2: some progress, thanks to the first answer. I now have the line:
df['List'] = df['List'].str.split(';')
which splits the column in place into an array, but the dtypes remain object When I then try to do
df['List'] = df['List'].astype(float)
I get the error:
return arr.astype(dtype)
ValueError: setting an array element with a sequence.
If I understand you correctly, you want to transform your data from pandas to numpy arrays.
I used this:
pandas_DataName.as_matrix(columns=None)
And it worked for me.
For more information visit here
I hope this could help you.

Based on count value i have to create number of rows,is that possible without java transformation?

Hey guys anyone know how to create number of rows based on the count value without using java transformation in informatica 9.6(For flat file).Please help me with that
You can create an auxiliary table with n rows for each possible count value between 1 and N:
1
2
2
3
3
3
...
...
N rows with the last value
...
N rows with the last value
Join this table to the source data using the n count value as the key and you will get n copies of each source row.