pyspark mathematical computation in a dataframe - python-2.7

I have extracted a Dataframe from a larger Dataframe, and now I need to do simple computation like addition and division in dataframe.
sample dataframe is like.
item counts
z 23156
x 15462
What I need to do is to divide x by sum of x and z
for example
value= x/x+z

You must compute the sum of x and first then divide x by sum(x) + sum(y)
for example:
Table 1(original table):
x z
1 2
3 4
Table 2 (Aggregated table):
table2 = sqlCtx.sql("select sum(x) + sum(z) as sum_xz")
table2.registerTempTable("table2")
sum_xz
10
Then join both table and divide
table3 = sqlCtx.sql("select a.x / bs.um_xz from table1 a join table2 b")
For your reference.

Related

Redshift generate rows as many as value in another column

df
customer_code contract_code product num_products
C0134 AB01245 toy_1 4
B8328 EF28421 doll_4 2
I would like to transform this table based on the integer value in column num_products and generate a unique id for each row:
Expected_df
unique_id customer_code contract_code product num_products
A1 C0134 AB01245 toy_1 1
A2 C0134 AB01245 toy_1 1
A3 C0134 AB01245 toy_1 1
A4 C0134 AB01245 toy_1 1
A5 B8328 EF28421 doll_4 1
A6 B8328 EF28421 doll_4 1
unique_id can be any random characters as long as I can use a count(distinct) on it later on.
I read that generate_series(1,10000) i is available in later versions of Postgres but not in Redshift
You need to use a recursive CTE to generate the series of number. Then join this with you data to produce the extra rows. I used row_number() to get the unique_id in the example below.
This should meet you needs or at least give you a start:
create table df (customer_code varchar(16),
contract_code varchar(16),
product varchar(16),
num_products int);
insert into df values
('C0134', 'AB01245', 'toy_1', 4),
('B8328', 'EF28421', 'doll_4', 2);
with recursive nums (n) as
( select 1 as n
union all
select n+1 as n
from nums
where n < (select max(num_products) from df) )
select row_number() over() as unique_id, customer_code, contract_code, product, num_products
from df d
left join nums n
on d.num_products >= n.n;
SQLfiddle at http://sqlfiddle.com/#!15/d829b/12

Power BI - Matching closest 3D points from two tables

I have two tables (Table 1 and Table 2) both containing thousands of three dimensional point coordinates (X, Y, Z), Table 2 also has an attribute column.
Table 1
X
Y
Z
6007
44268
1053
6020
44269
1051
Table 2
X
Y
Z
Attribute
6011
44310
1031
A
6049
44271
1112
B
I need to populate a calculated column in Table 1 with an attribute from Table 2 based on the minimum distance between points in 3D space. Basically, match the points in Table 1 to the closest point in Table 2 and then fetch the attribute from Table 2.
So far I have tried rounding X, Y and Z in both tables, then concatenating the rounded values into a separate column in each table. I then use DAX:
CALCULATE(FIRSTNONBLANK(Table 2 [Attribute],1),FILTER(ALL(Table2), Table 2[XYZ]=Table 1 [XYZ])).
This has given me reasonable success depending on the degree of rounding applied to the coordinates.
Is there a better way to achieve this in Power Bi?
This is similar to this post, except with a simpler distance function. See also this post.
Assuming you want the standard Euclidean Distance:
ClosestPointAttribute =
MINX (
TOPN (
1,
Table2,
( Table2[X] - Table1[X] ) ^ 2 +
( Table2[Y] - Table1[Y] ) ^ 2 +
( Table2[Z] - Table1[Z] ) ^ 2,
ASC
),
Table2[Attribute]
)
Note: I've omitted the SQRT from the formula because we don't need the actual distance, just the ordering (and SQRT preserves order since it's a strictly increasing function). You can include it if you prefer.
A function in M Code:
(p1 as list, q1 as list)=>
let
f = List.Generate(
()=> [x = Number.Power(p1{0}-q1{0},2), idx=0],
each [idx]<List.Count(p1),
each [x = Number.Power(p1{[idx]+1}-q1{[idx]+1},2), idx=[idx]+1],
each [x]
),
r = Number.Sqrt(List.Sum(f))
in
r
Each list is a set of coordinates and the function will return the distance between p and q
The above function (which I named fnDistance) can be incorporated into power query code as in this example:
let
//Read in both tables and set data types
Source2 =Excel.CurrentWorkbook(){[Name="Table_2"]}[Content],
table2 = Table.TransformColumnTypes(Source2,{{"X", Int64.Type}, {"Y", Int64.Type}, {"Z", Int64.Type},{"Attribute", Text.Type}}),
Source = Excel.CurrentWorkbook(){[Name="Table_1"]}[Content],
table1 = Table.TransformColumnTypes(Source,{{"X", Int64.Type}, {"Y", Int64.Type}, {"Z", Int64.Type}}),
//calculate distances from Table 1 coordinates to each of the Table 2 coordinates and store in a List
custom = Table.AddColumn(table1,"Distances", each
let
t2 = Table.ToRecords(table2),
X=[X],
Y=[Y],
Z=[Z],
distances = List.Generate(()=>
[d=fnDistance({X,Y,Z},{t2{0}[X],t2{0}[Y],t2{0}[Z]}),a=t2{0}[Attribute], idx=0],
each [idx] < List.Count(t2),
each [d=fnDistance({X,Y,Z},{t2{[idx]+1}[X],t2{[idx]+1}[Y],t2{[idx]+1}[Z]}),a=t2{[idx]+1}[Attribute], idx=[idx]+1],
each {[d],[a]}),
//determine set of coordinates with the minimum distance and return associate Attribute
minDistance = List.Min(List.Alternate(List.Combine(distances),1,1,1)),
attribute = List.Range(List.Combine(distances), List.PositionOf(List.Combine(distances),minDistance)+1,1){0}
in
attribute, Text.Type)
in
custom

How to calculate performance curve for each row of data

I want to plot a performance curve for each row of data I have.
A simple version of what I want to do is plot the function with the equation as Y= m*X+b, where I have a table with m and b values and I want Y values for X = 1 to 10.
How is this calculated?
A Y = mX + b example can be seen in the following plot:
The following works:
WITH NUMBERS AS
(
SELECT N FROM (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10))N(N)
),
Examples AS
(
SELECT m,b FROM (VALUES (1,2),(2,2))N(m,b)
)
SELECT
'Y = ' + CAST(Examples.m as varchar(10)) + 'X + ' + CAST(Examples.b as varchar(10)) AS Formula
,Numbers.N AS X
, Numbers.N * Examples.m + Examples.b
FROM Examples
CROSS JOIN NUMBERS

Deleting duplicate x values and their corresponding y values

I am working with a list of points in python 2.7 and running some interpolations on the data. My list has over 5000 points and I have some repeating "x" values within my list. These repeating "x" values have different corresponding "y" values. I want to get rid of these repeating points so that my interpolation function will work, because if there are repeating "x" values with different "y" values it runs an error because it does not satisfy the criteria of a function. Here is a simple example of what I am trying to do:
Input:
x = [1,1,3,4,5]
y = [10,20,30,40,50]
Output:
xy = [(1,10),(3,30),(4,40),(5,50)]
The interpolation function I am using is InterpolatedUnivariateSpline(x, y)
have a variable where you store the previous X value, if it is the same as the current value then skip the current value.
For example (pseudo code, you do the python),
int previousX = -1
foreach X
{
if(x == previousX)
{/*skip*/}
else
{
InterpolatedUnivariateSpline(x, y)
previousX = x /*store the x value that will be "previous" in next iteration
}
}
i am assuming you are already iterating so you dont need the actualy python code.
A bit late but if anyone is interested, here's a solution with numpy and pandas:
import pandas as pd
import numpy as np
x = [1,1,3,4,5]
y = [10,20,30,40,50]
#convert list into numpy arrays:
array_x, array_y = np.array(x), np.array(y)
# sort x and y by x value
order = np.argsort(array_x)
xsort, ysort = array_x[order], array_y[order]
#create a dataframe and add 2 columns for your x and y data:
df = pd.DataFrame()
df['xsort'] = xsort
df['ysort'] = ysort
#create new dataframe (mean) with no duplicate x values and corresponding mean values in all other cols:
mean = df.groupby('xsort').mean()
df_x = mean.index
df_y = mean['ysort']
# poly1d to create a polynomial line from coefficient inputs:
trend = np.polyfit(df_x, df_y, 14)
trendpoly = np.poly1d(trend)
# plot polyfit line:
plt.plot(df_x, trendpoly(df_x), linestyle=':', dashes=(6, 5), linewidth='0.8',
color=colour, zorder=9, figure=[name of figure])
Also, if you just use argsort() on the values in order of x, the interpolation should work even without the having to delete the duplicate x values. Trying on my own dataset:
polyfit on its own
sorting data in order of x first, then polyfit
sorting data, delete duplicates, then polyfit
... I get the same result twice

Fetching top n records in pandas pivot , based on multiple criteria and plotting them with matplotlib

Usecase : Extending the pivot functionality of Pandas. Fetch top n records & plot them against its own "Click %"(s) vs. no of records of that name
import pandas as pd
import numpy as np
df1 = pd.DataFrame({'name':['A', 'A', 'B', 'B','C','A'], 'click':[1,1,0,1,1,0]})
click name
0 1 A
1 1 A
2 0 B
3 1 B
4 1 C
5 0 A
[6 rows x 2 columns]
#fraction of records present & clicks as a fraction of it's OWN records present
f=df1.pivot_table(rows='name', aggfunc=[len, np.sum])
f['len']['click']/sum(f['len']['click']) , f['sum']['click']/sum(f['sum']['click'])
(name
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
Name: click, dtype: float64)
But to be able to plot them need to store the top n records in an object that is supported by matplotlib.
I tried storing the
"top names" A,B, C ..etc by creating dict (output of
f['len']['click']/sum(f['len']['click']
) )- and sorted by values - after which I stored the "click %" [A -> 0.50, B -> 0.25 , C-> 0.25] also in the same dictionary.
**Since this is clearly an overkill - wondering if there's a more pythonic way to do this ? **
I also tried head with groupby clause, but it doesn't give me what I am looking for. I am looking for a dataframe as above
A 0.500000
B 0.333333
C 0.166667
Name: click, dtype: float64, name
A 0.50
B 0.25
C 0.25
except that the top n logic should be embedded (head(n) does not work with n depends on my data-set - I guess I need to use "apply" ? - and post this the Object , which is a "" object needs to be identified by matplotlib with its own labels (top n "name" here)
Here's my dict function implementation :- # This is an OVERKILL just to fetch top n by a custom criteria as above
def freq_counts(df_var,n): # df_var is like df1.name , just to make the top n logic generic for each column name
perct_freq=dict((df_var.value_counts()*100)/len(df_var))
vec=[]
for key,value in perct_freq.items():
if value>=n :
vec.append([key,value])
return vec
freq_counts(df1.name,3) # eg. top 3 freq counts - to get the names, see vec[i][0] which has the corresponding keys
#In this example when I calculate the "perct_freq", which is a Series object, I would ideally want to avoid converting this to a dict - What an overkill !
Store the actual occurances (len of names) , and find the fraction of a "name" in population
Against this, also fins the "sucess outcome" and find it as a fraction of its OWN population
Finally plot top n name(s), output of (1) & (2) in same plot - criteria for top n should be based on (1) as a percentage
Ie. for (1) & (2) use dataframes that support plot with
name as labels in x axis
(1) as y axis (primary)
(2) as y axis (secondary)
PPS: In the code above -
(1) is > f['len']['click']/sum(f['len']['click']) and
(2) is > f['sum']['click']/sum(f['sum']['click'])