apply a function to a row with condition - row

i got this data frame
x1 x2 x3
1 2.5 2.8 1.4
2 2.1 1.9 2.3
3 1.7 2.2 4.4
4 2.4 3.8 3.7
5 4.3 4.4 4.1
6 4.2 4.9 2.4
7 2.7 1.5 2.5
8 2.8 3.3 4.9
9 3.5 2.3 2.9
10 4.1 2.8 2.2
so i need to check for every row a condition and apply a function to this row so that the value of this function would be in the fourth column or in the external vector. i.e. if min_value_of_row < thrshld then min(row) else mean(row)
How would one do that?

A bit late, but I was looking for something similar. Firstly I would create two columns with min and mean values of each row with:
df['min'] = df.min(axis=1)
and
df['mean'] = df.mean(axis=1)
then build a function:
def f(x):
thr = 2
if x['min'] <= thr:
x = x['min']
else:
x = x['mean']
return x
and apply it to the dataframe row-wise (axis=1):
df['value'] = df.apply(f, axis=1)
this returns:
x1 x2 x3 value
1 2.5 2.8 1.4 1.400
2 2.1 1.9 2.3 1.900
3 1.7 2.2 4.4 1.700
4 2.4 3.8 3.7 3.075
5 4.3 4.4 4.1 4.225
6 4.2 4.9 2.4 3.475
7 2.7 1.5 2.5 1.500
8 2.8 3.3 4.9 3.450
9 3.5 2.3 2.9 2.750
10 4.1 2.8 2.2 2.825

Related

Define a new variable name based on a conditon within R dplyr universe (mutate, if, ifelse)

I want to add a new variable in an dplyr workflow and define the variable name based on a condition. There is a lot of discussion on conditional mutating with ifelse() out there on how to define values of a given variable, but not on how to conditionally define the name.
Something like:
Test <- 'A'
Test_results <- c(1.1, 33, 343, 2.22, 2.4)
##
iris<- iris%>%
dplyr::mutate(
ifelse(Test=='A',
Test_A=Test_results,
ifelse(Test=='B',
Test_B=Test_results,
no_Test='no_results')) )
Desired output (given that Test <- 'A') is:
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test_A
1 5.1 3.5 1.4 0.2 setosa 1.1
2 4.9 3.0 1.4 0.2 setosa 33
3 4.7 3.2 1.3 0.2 setosa 343
4 4.6 3.1 1.5 0.2 setosa 2.22
5 5.0 3.6 1.4 0.2 setosa 2.4
...
If Test <- 'B' the result should be:
> iris
Sepal.Length Sepal.Width Petal.Length Petal.Width Species Test_B
1 5.1 3.5 1.4 0.2 setosa 1.1
2 4.9 3.0 1.4 0.2 setosa 33
3 4.7 3.2 1.3 0.2 setosa 343
4 4.6 3.1 1.5 0.2 setosa 2.22
5 5.0 3.6 1.4 0.2 setosa 2.4
...
The variable "Test" is defined somewhere in the users cockpit and does effects on multiple nested scripts (so no hard coding pls).
dplyr::rename_at should work. Create a column with the tests and rename it with a function that depends on Test.
Test <- 'A'
Test_results <- c(1.1, 33, 343, 2.22, 2.4)
iris %>%
head(n = 5) %>%
mutate(Test_results = Test_results) %>%
rename_at('Test_results', \(x) case_when(Test %in% c('A', 'B') ~ paste0('Test_', Test),
TRUE ~ 'no_results'))

How to Pivot data in Power BI and then show a line chart for the pivot-ed data

I have interest rates curves data for different dates and i want to compare them. In excel I create a pivot and then from pivot a chart. How do I the same in power bi?
data example:
example of data pivoted in excel (note the filter here chart comparing the series):
Example of PivotChart
I want to create this chart in Power BI
data in text format
SeriesName
SeqId
Data
Value
EUROIS
1
31-Dec-21
1.1
EUROIS
2
31-Dec-21
1.2
EUROIS
3
31-Dec-21
1.3
EUROIS
4
31-Dec-21
1.4
EUROIS
5
31-Dec-21
1.5
EUREURIBOR3M
1
31-Dec-21
3.2
EUREURIBOR3M
2
31-Dec-21
3.3
EUREURIBOR3M
3
31-Dec-21
3.4
EUREURIBOR3M
4
31-Dec-21
3.5
EUREURIBOR3M
5
31-Dec-21
3.6
EUROIS
1
31-Jan-22
0.1
EUROIS
2
31-Jan-22
0.2
EUROIS
3
31-Jan-22
0.3
EUROIS
4
31-Jan-22
0.4
EUROIS
5
31-Jan-22
0.5
EUREURIBOR3M
1
31-Jan-22
2.2
EUREURIBOR3M
2
31-Jan-22
2.3
EUREURIBOR3M
3
31-Jan-22
2.4
EUREURIBOR3M
4
31-Jan-22
2.5
EUREURIBOR3M
5
31-Jan-22
2.6

calculates the average of identical columns of several dataframes

I am trying to write a function that calculates the average of identical columns of different dataframes stored in a list:
def mean(dfs):
# declare an empty dataframe
df_mean = pd.DataFrame()
# assign the first column from each raw data framework to df
for i in range(len(dfs)):
dfs[i].set_index(['Time'], inplace=True)
for j in dfs[0].columns:
for i in range(len(dfs)):
df_mean[j] = pd.concat([df_mean,dfs[i][j]], axis=1).mean(axis=1)
return df_mean
dfs = []
l1 = [[1,6,2,6,7],[2,3,2,6,8],[3,3,2,8,8],[4,5,2,6,8],[5,3,9,6,8]]
l2 = [[1,7,2,5,7],[2,3,0,6,8],[3,3,3,6,8],[4,3,7,6,8],[5,3,0,6,8]]
dfs.append(pd.DataFrame(l1, columns=['Time','25','50','75','100']))
dfs.append(pd.DataFrame(l2, columns=['Time','25','50','75','100']))
mean(dfs)
However, I got out only the mean of the first column right!
Option 1
Use python's sum, which well default to reducing the list based on the individual object's __add__ method. Then just divide by the length of the list.
sum(dfs) / len(dfs)
Time 25 50 75 100
0 1.0 6.5 2.0 5.5 7.0
1 2.0 3.0 1.0 6.0 8.0
2 3.0 3.0 2.5 7.0 8.0
3 4.0 4.0 4.5 6.0 8.0
4 5.0 3.0 4.5 6.0 8.0
Option 2
Reconstruct while using numpy's mean function
pd.DataFrame(
np.mean([d.values for d in dfs], 0),
dfs[0].index, dfs[0].columns)
Time 25 50 75 100
0 1.0 6.5 2.0 5.5 7.0
1 2.0 3.0 1.0 6.0 8.0
2 3.0 3.0 2.5 7.0 8.0
3 4.0 4.0 4.5 6.0 8.0
4 5.0 3.0 4.5 6.0 8.0
Use concat on Time indexed list of dataframes, and groupby larger dataframe on Time and take mean
In [275]: pd.concat([d.set_index('Time') for d in dfs]).groupby(level='Time').mean()
Out[275]:
25 50 75 100
Time
1 6.5 2.0 5.5 7.0
2 3.0 1.0 6.0 8.0
3 3.0 2.5 7.0 8.0
4 4.0 4.5 6.0 8.0
5 3.0 4.5 6.0 8.0
Or, since Time column is anyway common for both, atleast in this usecase
In [289]: pd.concat(dfs).groupby(level=0).mean()
Out[289]:
Time 25 50 75 100
0 1.0 6.5 2.0 5.5 7.0
1 2.0 3.0 1.0 6.0 8.0
2 3.0 3.0 2.5 7.0 8.0
3 4.0 4.0 4.5 6.0 8.0
4 5.0 3.0 4.5 6.0 8.0
Details
In [276]: dfs
Out[276]:
[ Time 25 50 75 100
0 1 6 2 6 7
1 2 3 2 6 8
2 3 3 2 8 8
3 4 5 2 6 8
4 5 3 9 6 8, Time 25 50 75 100
0 1 7 2 5 7
1 2 3 0 6 8
2 3 3 3 6 8
3 4 3 7 6 8
4 5 3 0 6 8]
In [277]: pd.concat([d.set_index('Time') for d in dfs])
Out[277]:
25 50 75 100
Time
1 6 2 6 7
2 3 2 6 8
3 3 2 8 8
4 5 2 6 8
5 3 9 6 8
1 7 2 5 7
2 3 0 6 8
3 3 3 6 8
4 3 7 6 8
5 3 0 6 8

why ranks were different?

One:
data have;
input x1 x2;
diff=x1-x2;
a_diff= round(abs(diff), .01);
* a_diff=abs(diff);
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;
proc rank data =have out =have_r;
where diff;
var a_diff ;
ranks a_diff_r;
run;
proc print data =have_r;run;
Results:
Obs x1 x2 diff a_diff a_diff_r
1 50.7 60.0 -9.3 9.3 9.0
2 28.8 30.0 -1.2 1.2 4.0
3 46.2 43.2 3.0 3.0 7.5
4 1.2 2.2 -1.0 1.0 3.0
5 25.5 27.5 -2.0 2.0 5.5
6 2.9 4.9 -2.0 2.0 5.5
7 5.4 5.0 0.4 0.4 1.0
8 3.8 3.2 0.6 0.6 2.0
9 1.0 4.0 -3.0 3.0 7.5
Two:
data have;
input x1 x2;
diff=x1-x2;
a_diff=abs(diff);
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;
proc rank data =have out =have_r;
where diff;
var a_diff ;
ranks a_diff_r;
run;
proc print data =have_r;run;
results:
Obs x1 x2 diff a_diff a_diff_r
1 50.7 60.0 -9.3 9.3 9.0
2 28.8 30.0 -1.2 1.2 4.0
3 46.2 43.2 3.0 3.0 7.5
4 1.2 2.2 -1.0 1.0 3.0
5 25.5 27.5 -2.0 2.0 5.0
6 2.9 4.9 -2.0 2.0 6.0
7 5.4 5.0 0.4 0.4 1.0
8 3.8 3.2 0.6 0.6 2.0
9 1.0 4.0 -3.0 3.0 7.5
Attention Please,Obs 3,9,5,6, why ranks were different? Thank you!
Run the code below and you'll see that they are actually different. That's because of inaccuracies in numeric storage; similar to how 1/3 is not representable in decimal notation (0.333333333333333 etc.) and 1-(1/3)-(1/3)-(1/3) is not equal to zero if you use, say, ten digits to store each result as you go (it is equal to 0.000000001, then), any computer system will have some issues with certain numbers that while in decimal (base 10) appear to store nicely, in binary do not.
The solution here is basically to round as you are, or to fuzz the result which amounts to the same thing (it ignores differences less than 1x10^-12).
data have;
input x1 x2;
diff=x1-x2;
a_diff=abs(diff);
put a_diff= hex16.;
cards;
50.7 60
3.3 3.3
28.8 30
46.2 43.2
1.2 2.2
25.5 27.5
2.9 4.9
5.4 5
3.8 3.2
1 4
;
run;

Read by Row in SAS

How do I read a data file one row at a time in SAS?
Say, I have 3 lines of data
1.0 3.0 5.6 7.8
2.3 4.9
3.2 5.3 6.8 7.5 3.9 4.1
I have to read each line in a different variable. I want the data to look like.
A 1.0
A 3.0
A 5.6
A 7.8
B 2.3
B 4.9
C 3.2
C 5.3
C 6.8
C 7.5
C 3.9
C 4.1
I tried a bunch of things.
If it has a variable name before every data point, following code works fine
INPUT group $ x ##;
I can't figure out how to go about this. Can someone please guide me on this?
Thanks
i think this will produce almost exactly the result you want. you could apply a format to the Group variable.
data orig;
infile datalines missover pad;
format Group 4. Value 4.1;
Group = _n_;
do until (Value eq .);
input value #;
if value ne . then output;
else return;
end;
datalines;
1.0 3.0 5.6 7.8
2.3 4.9
3.2 5.3 6.8 7.5 3.9 4.1
run;
proc print; run;
/*
Obs Group Value
1 1 1.0
2 1 3.0
3 1 5.6
4 1 7.8
5 2 2.3
6 2 4.9
7 3 3.2
8 3 5.3
9 3 6.8
10 3 7.5
11 3 3.9
12 3 4.1 */