Shortest Path between two matrices - python-2.7

I have two distance matrices with overlapping variable names.
dfA:
Start A1 A2 A3 A4 … A150
Location
A 12 4 12 2 9
B 5 2 19 4 3
C 1 4 8 7 12
dfB:
A B C
X 4 12 32
Y 1 6 12
Z 2 8,5 11
So from start A1, A2, etc. through ABC there are paths to X, Y and Z
I would like to see what is the shortest path for an item, for example the the combination A1 -> Z. I programmed this by loading csv's with the distance matrices and unstack them. Then with df.itterows() and two for loops loop through the possible combinations and see what the smallest is for the combination A1 -> Z.
But since i have to do this for around 30000 items, it takes way to long.
Anybody know how to do this in a vectorized way?

I added D so that the axis lengths will be different (dfB won't be square matrix) just for my convenience (it works with square matrices too).
import pandas as pd
import numpy as np
df_a = pd.read_csv('dfA.csv', delim_whitespace=True, index_col=0, decimal=",")
df_b = pd.read_csv('dfB.csv', delim_whitespace=True, index_col=0, decimal=",")
mat_a = df_a.values
mat_b = df_b.values
mat_a2 = np.expand_dims(mat_a, axis=2)
mat_b2 = np.expand_dims(mat_b.T, axis=1)
mat_a3 = np.tile(mat_a2, (1, 1, mat_b.shape[0]))
mat_b3 = np.tile(mat_b2, (1, mat_a.shape[1], 1))
tot = mat_a3 + mat_b3
ind = np.argmin(tot, axis=0).T
df_c = pd.DataFrame(df_b.columns.values[ind], columns=df_a.columns, index=df_b.index)
print(df_c)
dfA:
Start_Location A1 A2 A3 A4 A150
A 12 4 12 2 9
B 5 2 19 4 3
C 1 4 8 7 12
D 5 2 9 11 4
dfB:
A B C D
X 4 12 32 11,4
Y 1 6 2 9,3
Z 2 8,5 11 1,4
dfC:
A1 A2 A3 A4 A150
X A A A A A
Y C A C A B
Z D D D A D

Related

iterate over Dataframe row by index value and find max

I need to iterate over df rows based on its index. I need to find the max in the column p1 and fill it in the output dataframe (along with the max p1), the same for the column p2. In each range of my row indexes (sub_1_ica_1---> sub_1_ica_n), there must be only one 1 and one 2 and I need to fill the remaining with zeros. That's why I need to do the operation range by range.
I tried to split the index name and make a counter for each subject to be used to iterate over the rows, but I feel that I am in the wrong way!
from collections import Counter
a = df.id.tolist()
indlist = []
for x in a:
i = x.split('_')
b = int(i[1])
indlist.insert(-1,b)
c=Counter(indlist)
keyInd = c.keys()
Any ideas?
EDIT: according to Jerazel example my desired output would look like this.
First I find the max for p1 and p2 columns which will be translated in the new df into 1 and 2, and the remaining fields will be zeros
I think you need numpy.argmax with max, also if need columns names use idxmax:
idx = ['sub_1_ICA_0','sub_1_ICA_1','sub_1_ICA_2','sub_2_ICA_0','sub_2_ICA_1','sub_2_ICA_2']
df = pd.DataFrame({'p0':[7,8,9,4,2,3],
'p1':[1,3,5,7,1,0],
'p2':[5,9,6,1,2,4]}, index=idx)
print (df)
cols = ['p0','p1','p2']
df['a'] = df[cols].values.argmax(axis=1)
df['b'] = df[cols].max(axis=1)
df['c'] = df[cols].idxmax(axis=1)
print (df)
p0 p1 p2 a b c
sub_1_ICA_0 7 1 5 0 7 p0
sub_1_ICA_1 8 3 9 2 9 p2
sub_1_ICA_2 9 5 6 0 9 p0
sub_2_ICA_0 4 7 1 1 7 p1
sub_2_ICA_1 2 1 2 0 2 p0
sub_2_ICA_2 3 0 4 2 4 p2

how to normalize rating in scale of 1 to 5?

In Yahoo! Movie dataset the rating scale is from 1 to 13. here, 1 represent good rating and 13 represent the lowest rating to the movie.
if there is 0 then it represents that user didn't rate that movie.
rating { 13 12 11 10 9 8 7 6 5 4 3 2 1 0} OR
rating { A+ A A- B+ B B- C+ C C- C+ D D- F 0}
eg. user m1 m2 m3
1 2 3 13
2 0 1 7
but I don't know how to normalize rating in the scale of 1 to 13 into a scale of 1 to 5.
simply I can do one thing i.e.
{A+,A,A-} = 5
{B+,B,B-} = 4
{C+,C,C-} = 3
{D+,D,D-} = 2
{F} = 1
is there any other method or by using any formula ?
If floating points are allowed, simply multiply with 5/13. Round to full numbers if necessary.
If 5 is the best, substract the result from 6 (handle 0 with an if clause)

Python - multiply range of column by last column

I have a data frame in the following form
name v1 v2 v3
x 1 4 7
y 2 5 8
z 3 6 9
I want to multiply each value in the middle two columns by the value in the final column, output would be:
name v1 v2 v3
x 7 28 7
y 16 40 8
z 27 54 9
My current attempt is giving an error, Index object has no attribute apply
df[df.columns[1:-2]] = df.columns[1:-2].apply(lambda x : (x*df.columns[-1])
You can use iloc for selecting by position with mul:
print (df.iloc[:, 1:-1])
v1 v2
0 1 4
1 2 5
2 3 6
df.iloc[:, 1:-1] = df.iloc[:, 1:-1].mul(df.iloc[:, -1], axis=0)
print (df)
name v1 v2 v3
0 x 7 28 7
1 y 16 40 8
2 z 27 54 9
Solution with selecting columns by names:
df[['v1','v2']] = df[['v1','v2']].mul(df['v3'], axis=0)
print (df)
name v1 v2 v3
0 x 7 28 7
1 y 16 40 8
2 z 27 54 9

SAS Function to calculate percentage for row for two stratifications

I have a dataset that looks like this
data test;
input id1$ id2$ score1 score2 score3 total;
datalines;
A D 9 36 6 51
A D 9 8 6 23
A E 5 3 2 10
B D 5 3 3 11
B E 7 4 7 18
B E 5 3 3 11
C D 8 7 9 24
C E 8 52 6 66
C D 4 5 3 12
;
run;
I want to add a column that calculates what percentage of the corresponding total is of the summation within id1 and id2.
What I mean is this; id1 has a value of A. Within the value of A, there are twoid2 values ; D and E. There are two values of D, and one of E. The two total values of D are 51 and 23, and they sum to 74. The one total value of E is 10, and it sums to 10. The column I'd like to create would hold the values of .68 (51/74), .31 (23/74), and 1 (10/10) in row 1 ,row 2, and row 3 respectively.
I need to perform this calculations for the rest of the id1 and their corresponding id2. So when complete, I want a table that would like like this:
id1 id2 score1 score2 score3 total percent_of_total
A D 9 36 6 51 0.689189189
A D 9 8 6 23 0.310810811
A E 5 3 2 10 1
B D 5 3 3 11 1
B E 7 4 7 18 0.620689655
B E 5 3 3 11 0.379310345
C D 8 7 9 24 0.666666667
C E 8 52 6 66 1
C D 4 5 3 12 0.333333333
I realize a loop might be able to solve the problem I've given, but I'm dealing with EIGHT levels of stratification, with as many as 98 sublevels within those levels. A loop is not practical. I'm thinking something along the lines of PROC SUMMARY but I'm not too familiar with the function.
Thank you.
It is easy to do with a data step. Make sure the records are sorted.
You can find the grand total for the ID1*ID2 combination and then use it to calculate the percentage.
proc sort data=test;
by id1 id2;
run;
data want ;
do until (last.id2);
set test ;
by id1 id2 ;
grand = sum(grand,total);
end;
do until (last.id2);
set test ;
by id1 id2 ;
precent_of_total = total/grand ;
output;
end;
run;

How to group data in kdb+ using customized groups?

I have a table (allsales) with a column for time (sale_time). I want to group the data by sale_time. But I want to be able to bucket this. ex any data where time is between 00:00:00-03:00:00 should be grouped together, 03:00:00-06:00:00 should be grouped together and so on. Is there a way to write such a query?
xbar is useful for rounding to interval values e.g.
q)5 xbar 1 3 5 8 10 11 12 14 18
0 0 5 5 10 10 10 10 15
We can then use this to group rows into time groups, for your example:
q)s:([] t:13:00t+00:15t*til 24; v:til 24)
q)s
t v
--------------
13:00:00.000 0
13:15:00.000 1
13:30:00.000 2
13:45:00.000 3
14:00:00.000 4
14:15:00.000 5
..
q)select count i,sum v by xbar[`int$03:00t;t] from s
t | x v
------------| ------
12:00:00.000| 8 28
15:00:00.000| 12 162
18:00:00.000| 4 86
"by xbar[`int$03:00t;t]" rounds the time column t to the nearest three hour value, then this is used as the group by.
There are few more ways to achieve the same results.
q)select count i , sum v by t:01:00u*3 xbar t.hh from s
q)select count i , sum v by t:180 xbar t.minute from s
t | x v
-----| ------
12:00| 8 28
15:00| 12 162
18:00| 4 86
But in all cases, be careful of the date column if present in the table, otherwise same time window across different dates will generate the wrong results.
q)s:([] d:24#2013.05.07 2013.05.08; t:13:00t+00:15t*til 24; v:til 24)
q)select count i , sum v by d, t:180 xbar t.minute from s
d t | x v
----------------| ----
2013.05.07 12:00| 4 12
2013.05.07 15:00| 6 78
2013.05.07 18:00| 2 42
2013.05.08 12:00| 4 16
2013.05.08 15:00| 6 84
2013.05.08 18:00| 2 44