Apply a function to every value in a tibble (and return a tibble)? - apply

Straightforward usage here, but most documentation about apply/plyr/dplyr is explaining more complex operations.
I want to create a new tibble from this_tbl
> this_tbl
# A tibble: 3 x 2
x y
<dbl> <dbl>
1 42 999
2 0 0
3 1 0
Such that every value > 0 is turned into a 1, and every value <= 0 is becomes a 0.
> as_tibble(apply(this_tbl,2,function(x){ifelse(x>0, 1, 0)}))
# A tibble: 3 x 2
x y
<dbl> <dbl>
1 1 1
2 0 0
3 1 0
That works just fine, but is there a more elegant way to do this?

dplyr::mutate_all from #Jack Brookes solution is now superseded by the across adverb, even though there is no need for the additional functionality in this simple example:
this_tbl %>% mutate(across(, function(x) ifelse(x > 0, 1, 0)))

dplyr::mutate_all applies a function to all columns in a dataframe and returns the result.
this_tbl %>%
mutate_all(function(x){ifelse(x>0, 1, 0)})
Technically, this doesn't apply the function to "every value" but to each column as a whole, which is much faster. If there is a case where you want to do it value-by-value, you could make a vectorised version of the function.
greater_than_zero <- Vectorized(function(x){
ifelse(x > 0, 1, 0)
})
this_tbl %>%
mutate_all(greater_than_zero)

Related

Optimize with indexing in linear programming

I have encountered several optimization problems that involve identifying one or more indices in a vector that maximizes or minimizes a cost. Is there a way to identify such indices in linear programming? I'm open to solutions in mathprog, CVXR, CVXPY, or any other API.
For example, identifying an index is needed for change point problems (find the index at which the function changes), putting distance constraints on the traveling salesman problem (visit city X before cumulative distance Y).
As a simple example, suppose we want to identify the location in a vector where the sum on either side is the most equal (their difference is smallest). In this example, the solution is index 5:
x = c(1, 3, 6, 4, 7, 9, 6, 2, 3)
Attempt 1
Using CVXR, I tried declaring split_index and using that as an index (e.g., x[1:split]):
library(CVXR)
split_index = Variable(1, integer = TRUE)
objective = Minimize(abs(sum(x[1:split_index]) - sum(x[(split_index+1):length(x)])))
result = solve(objective)
It errs 1:split_index with NA/NaN argument.
Attempt 2
Declare an explicit index-vector (indices) and do an elementwise logical test whether split_index <= indices. Then element-wise-multiply that binary vector with x to select one or the other side of the split:
indices = seq_along(x)
split_index = Variable(1, integer = TRUE)
is_first = split_index <= indices
objective = Minimize(abs(sum(x * is_first) - sum(x * !is_first)))
result = solve(objective)
It errs in x * is_first with non-numeric argument to binary operator. I suspect that this error arises because is_first is now an IneqConstraint object.
Symbols in red are decision variables and symbols in blue are constants.
R code:
> library(Rglpk)
> library(CVXR)
>
> x <- c(1, 3, 6, 4, 7, 9, 6, 2, 3)
> n <- length(x)
> delta <- Variable(n, boolean=T)
> y <- Variable(2)
> order <- list()
> for (i in 2:n) {
+ order[[as.character(i)]] <- delta[i-1] <= delta[i]
+ }
>
>
> problem <- Problem(Minimize(abs(y[1]-y[2])),
+ c(order,
+ y[1] == t(1-delta) %*% x,
+ y[2] == t(delta) %*%x))
> result <- solve(problem,solver = "GLPK", verbose=T)
GLPK Simplex Optimizer, v4.47
30 rows, 12 columns, 60 non-zeros
0: obj = 0.000000000e+000 infeas = 4.100e+001 (2)
* 7: obj = 0.000000000e+000 infeas = 0.000e+000 (0)
* 8: obj = 0.000000000e+000 infeas = 0.000e+000 (0)
OPTIMAL SOLUTION FOUND
GLPK Integer Optimizer, v4.47
30 rows, 12 columns, 60 non-zeros
9 integer variables, none of which are binary
Integer optimization begins...
+ 8: mip = not found yet >= -inf (1; 0)
+ 9: >>>>> 1.000000000e+000 >= 0.000000000e+000 100.0% (2; 0)
+ 9: mip = 1.000000000e+000 >= tree is empty 0.0% (0; 3)
INTEGER OPTIMAL SOLUTION FOUND
> result$getValue(delta)
[,1]
[1,] 0
[2,] 0
[3,] 0
[4,] 0
[5,] 0
[6,] 1
[7,] 1
[8,] 1
[9,] 1
> result$getValue(y)
[,1]
[1,] 21
[2,] 20
>
The absolute value is automatically linearized by CVXR.
At the end of the day, if you are selecting things by index, I think you need to work this with a set of corresponding binary selection variables. The fact that you are selecting "things in a row" as in your example problem is just something that needs to be handled with constraints on the binary variables.
To solve the problem you posed, I made a set of binary selection variables, call it s[i] where i = {0, 1, 2, ..., len(x)} and then constrained:
s[i] <= s[i-1] for i = {1, 2, ..., len(x)}
which enforces the "continuity" from the start up to the first non-selection and then thereafter.
My solution is in Python. LMK if you'd like me to post. The concept above, I think, is what you are asking about.

if else loop not working

I am a beginner in R studio, so hopefully someone can help me with this problem. The case: I want to make an if else loop. I made the following code for an l times m matrix:
for (i in 1:l){
for (j in 1:m){
if (is.na(quantilereturns[i,j]) < quantile(quantilereturns[,j], c(.1), na.rm=TRUE)) {
quantilereturns[i,j]
} else { (0) }
}
}
Summary: I want to make a matrix with values that are smaller than the quantile of a certain vector in the matrix quantilereturns. So when they are smaller than the 10% quantile they get their original value otherwise it will be a zero.
The code doesn't give any errors, but it doesn't change the values in the matrix either.
Can someone help me?
You need to assign the result to a cell of the matrix. I will take the matrix of a recent other thread as an example:
a <- c(4, -9, 2)
b <- c(-1, 3, -8)
c <- c(5, 2, 6)
d <- c(7, 9, -2)
matrix <- cbind(a,b,c,d)
d <- dim(matrix)
rows <- d[1]
columns <- d[2]
print("Before")
print(matrix)
for (i in 1:rows) {
for (j in 1:columns) {
if (is.na(matrix[i,j]) >= quantile(matrix[,j], c(.1), na.rm=TRUE)) {
matrix[i,j] <- 0
}
}
}
print("After")
print(matrix)
this gives
[1] "Before"
a b c d
[1,] 4 -1 5 7
[2,] -9 3 2 9
[3,] 2 -8 6 -2
[1] "After"
a b c d
[1,] 0 0 5 0
[2,] 0 0 2 0
[3,] 0 0 6 0
So the essential line you are looking for is matrix[i,j] <- 0

Need help writing estimates statements in proc genmod

I'm using proc genmod to predict an outcome measured at 4 time points. The outcome is a total score on a mood inventory, which can range from 0 to 82. A lot of participants have a score of 0, so the negative binomial distribution in proc genmod seemed like a good fit for the data.
Now, I'm struggling with how to write/interpret the estimates statements. The primary predictors are TBI status at baseline (0=no/1=yes), and visit (0=baseline, 1=second visit, 2=third visit, 4=fourth visit), and an interaction of TBI status and visit.
How do I write my estimates, such that I'm getting out:
1. the average difference in mood inventory score for person with TBI versus a person without, at baseline.
and
2. the average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits?
Below is what I have thus far, but I'm not sure how to interpret the output, also below, if indeed my code is correct.:
proc genmod data = analyze_long_3 ;
class id screen_tbi (param = ref ref = first) ;
model nsi_total = visit_cent screen_tbi screen_tbi*visit_cent /dist=negbin ;
output predicted = predstats;
repeated subject=id /type=cs;
estimate "tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 1 0 /exp;
estimate "no tbi" intercept 1 visit_cent 0 0 0 0 screen_tbi 0 1 /exp;
estimate 'longitudinal TBI' intercept 1
visit_cent -1 1 1 1
screen_tbi 1 0
screen_tbi*visit_cent 1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 / exp;
estimate 'longitudinal no TBI ' intercept 1
visit_cent -1 1 1 1
screen_tbi 0 1
screen_tbi*visit_cent 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1 / exp;
where sample = 1 ;
run;
The first research question is to have the average difference score, at baseline, for person with TBI versus a person without. It can be achieved by the following steps:
1) Get the estimated average log (score) when TBI = yes, and Visit = baseline;
2) Get the estimated average log (score) when TBI = no, and Visit =baseline;
3) 1) – 2) to have the difference in log(score) values
4) Exp[3)] to have the difference as percentage of change in scores
To simplify, let T=TBI levels, and V = Visit Levels. One thing to clarify, in your post, there are 4 visit points, the first as reference; therefore there should be 3 parameters for V, not four.
Taking the example of step 1), let’s try to write the ESTIMATE statement. It is a bit tricky. At first it sounds like this (T=0 and V =0 as reference):
ESTIMATE ‘Overall average’ intercept T 1 V 0 0 0;
But it is wrong. In the above statement, all arguments for V are set to 0. When all arguments are 0, it is the same as taking out V from the statement:
ESTIMATE ‘Overall average’ intercept T 1;
This is not the estimate of average for T=1 at baseline level. Rather, it produces an average for T=1, regardless of visit points, or, an average for all visit levels.
The problem is that the reference is set as V=0. In that case, SAS cannot tell the difference between estimates for the reference level, and the estimates for all levels. Indeed it always estimates the average for all levels. To solve it, the reference has to be set to -1, i.e., T=-1 and V=-1 as reference, such that the statement likes:
ESTIMATE ‘Average of T=1 V=baseline’ intercept T 1 V -1 -1 -1;
Now that SAS understands: fine! the job is to get the average at baseline level, not at all levels.
To make the reference value as -1 instead of 0, in the CLASS statement, the option should be specified as PARAM = EFFECT, not PARAM = REF. That brings another problem: once PARAM is not set as REF, SAS will ignore the user defined references. For example:
CLASS id T (ref=’…’) V (ref=’…’) / PARAM=EFFECT;
The (ref=’…’) is ignored when PARAM=EFFECT. How to let SAS make TBI=No and Visit=baseline as references? Well, SAS automatically takes the last level as the reference. For example, if the variable T is ordered ascendingly, the value -1 comes as the first level, while the value 1 comes as the last level; therefore 1 will be the reference. Conversely, if T is ordered in descending order, the value -1 comes at the end and will be used as the ref. This is achieved by the option ‘DESCENDING’ in the CLASS statement.
CLASS id T V / PARAM=EFFECT DESCENDING;
That way, the parameters are ordered as:
T 1 (TBI =1)
T -1 (ref level of TBI, i.e., TBI=no)
V 1 0 0 (for visit =4)
V 0 1 0 (visit = 3)
V 0 0 1 (visit =2)
V -1 -1 -1 (this is the ref level, visit=baseline)
The above information is reported in the ODS table ‘Class Level Information’. It is always good to check the very table each time after running PROC GENMOD. Note that the level (visit = 4) comes before the level (visit =3), visit =3 coming before visit=2.
Now, let’s talk a bit about the parameters and the model equation. As you might know, in SAS, the V for multi-levels is indeed broken down into dummy Vs. If baseline is set as ref level, the dummies will be like:
V4 = the fourth visit or baseline
V3= the third visit, or baseline
V2 = the second visit or baseline
Accordingly, the equation can be written as:
LOG(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
whereas:
s = the total score on a mood inventory
T = 1 for TBI status of yes, = -1 for TBI status of no
V4 = 1 for the fourth visit, = -1 for baseline
V3 = 1 for the third visit, =-1 for baseline
V2 = 1 for the second visit, = -1 for the baseline
b0 to b4 are beta estimates for the parameters
Of note, the order in the model is the same as the order defined in the statement CLASS, and the same as the order in the ODS table ‘Class Level Information’. The V4, V3, V2 have to appear in the model, all or none, i.e., if the VISIT term is to be included, V4 V3 V2 should be all introduced into the model equation. If the VISIT term is not included, none of V4, V3, and V2 should be in the equation.
With interaction terms, 3 more dummy terms must be created:
T_V4 = T*V4
T_V3 = T*V3
T_V2 = T*V2
Hence the equation with interaction terms:
Log(s) = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6* T_V3 + b7* T_V2
The SAS statement of ‘ESTIMATE’ is correspondent to the model equation.
For example, to estimate an overall average for all parameters and all levels, the equation is:
[Log(S)] = b0 ;
whereas [LOG(S)] stands for the expected LOG(score). Accordingly, the statement is:
ESTIMATE ‘overall (all levels of T and V)’ INTERCEPT;
In the above statement, ‘INTERCEPT’ in the statement is correspondent to ‘b0’ in the equation
To estimate an average of log (score) for T =1, and for all levels of visit points, the equation is
[LOG(S)] = b0 + b1 * T = b0 + b1 * 1
And the statement is
ESTIMATE ‘T=Yes, V= all levels’ INTERCEPT T 1;
In the above case, ‘T 1’ in the statement is correspondent to the part “*1” in the equation (i.e., let T=1)
To estimate an average of log (score) for T =1, and for visit = baseline, the equation is:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2
= b0 + b1*(1) + b2*(-1)+ b3*(-1) + b4*(-1)
The statement is:
ESTIMATE ‘T=Yes, V=Baseline’ INTERCEPT T 1 V -1 -1 -1;
‘V -1 -1 -1’ in the statement is correspondent to the values of V4, V3, and V2 in the equation. We’ve mentioned above that the dummies V4 V3 and V2 must be all introduced into the model. That is why for the V term, there are always three numbers, such as ‘V -1 -1 -1’, or ‘V 1 1 1’, etc. SAS will give warning in log if you make it like ‘V -1 -1 -1 -1’, because there are four '-1's, 1 more than required. In that case, the excessive '-1' will be ignored. On the contrary, ‘V 1 1’ is fine. It is the same as ‘V 1 1 0’. But what does 'V 1 1 0' means? To figure it out, you have to read Allison’s book (see reference).
For now, let’s carry on, and add the interaction terms. The equation:
[Log(s)] = b0 + b1*T + b2*V4 + b3*V3 + b4*V2 + b5*T_V4 + b6*T_V3 + b7*T_V2
As T_V4 = T*V4 = 1 * (-1) = -1, similarly T_V3 = -1, T_V2=-1, substitute into the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(-1) + b6*(-1) + b7*(-1)
The statement is:
ESTIMATE ‘(1) T=Yes, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V -1 -1 -1;
The ‘T*V -1 -1 -1’ are correspondent to the values of T_V4, T_V3 and T_V2 in the equation.
And that is the statement for step 1)!
Step 2 follows the same thoughts. To get the estimated average log (score) when TBI = no, and Visit =baseline.
T = -1, V4=-1, V3=-1, V2=-1.
T_V4 = T * V4 = (-1) * (-1) = 1
T_V3 = T * V3 = (-1) * (-1) = 1
T_V2 = T * V2 = (-1) * (-1) = 1
Substituting the values in the equation:
[Log(s)] = b0 + b1*1 + b2*(-1)+ b3*(-1)+ b4*(-1)+ b5*(1) + b6*(1) + b7*(1)
Note that the numbers: For T: 1; for V: -1 -1 -1; for interaction terms: 1 1 1
And the SAS statement:
ESTIMATE ‘(2) T=No, V=Baseline, with interaction’ INTERCEPT T 1 V -1 -1 -1 T*V 1 1 1;
The estimate results can be found in the ODS table ‘Contrast Estimate Results’.
For step 3), subtract the estimate (1) – (2), to have the difference of log(score); and for step(4), have the exponent of the diff in step 3).
For the second research question:
The average difference in mood inventory change score for a person with TBI versus a person without, over the 4 study visits.
Over the 4 study visits means for all visit levels. By now, you might have known that the statement is simpler:
ESTIMATE ‘(1) T=Yes, V=all levels’ INTERCEPT T 1;
ESTIMATE ‘(2) T=Yes, V=all levels’ INTERCEPT T -1;
Why there are no interaction terms? Because all visit levels are considered. And when all levels are considered, you do not have to put any visit-related terms into the statement.
Finally, the above approach requires some manual calculation. Indeed it is possible to make one single line of ESTIMATE statement that is equivalent to the aforementioned approach. However, the method we discussed above is way easier to understand. For more sophisticated methods, please read Allison’s book.
Reference:
1. Allison, Paul D. Logistic Regression Using SAS®: Theory and Application, Second Edition. Copyright © 2012, SAS Institute Inc.,Cary, North Carolina, USA.

python pandas dataframes add column depending on values other 2 col

I finally got to a message that I expected could solve my problem. I have two columns in a dataFrame (height, upper) with values either 1 or 0. The combination of this is 4 elements and with them I am trying to create a third column containing the 4 combinations, but I cannot figure out what is going wrong, My code is as follows:
def quad(clasif):
if (raw['upper']==0 and raw['height']==0):
return 1
if (raw['upper']==1 and raw['height']==0):
return 2
if (raw['upper']==0 and raw['height']==1):
return 3
if (raw['upper']==1 and raw['height']==1):
return 4
raw['cuatro']=raw.apply(lambda clasif: quad(clasif), axis=1)
I am getting the following error:
'The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().', u'occurred at index 0'
if someone could help?
Assuming that upper and height can only be 0 or 1, you can rewrite this as a simple addition:
raw['cuatro'] = 1 + raw['upper'] + 2 * raw['height']
The reason you see this error is because raw['upper'] == 0 is a Boolean series, which you can't use and... See the "gotcha" section of the docs.
I think you're missing the fundamentals of apply, when passed the Series clasif, your function should do something with clasif (at the moment, the function body makes no mention of it).
You have to pass the function to apply.
import pandas as pd
def quad(clasif):
if (clasif['upper']==0 and clasif['height']==0):
return 1
if (clasif['upper']==1 and clasif['height']==0):
return 2
if (clasif['upper']==0 and clasif['height']==1):
return 3
if (clasif['upper']==1 and clasif['height']==1):
return 4
​
raw = pd.DataFrame({'upper': [0, 0, 1, 1], 'height': [0, 1, 0, 1]})
raw['cuatro']=raw.apply(quad, axis=1)
print raw
height upper cuatro
0 0 0 1
1 1 0 3
2 0 1 2
3 1 1 4
Andy Hayden's answer is better suited for your case.

Formula that uses previous value

In Stata I want to have a variable calculated by a formula, which includes multiplying by the previous value, within blocks defined by a variable ID. I tried using a lag but that did not work for me.
In the formula below the Y-1 is intended to signify the value above (the lag).
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y-1 if count != 1
X Y count ID
. 1 1 1
2 3 2 1
1 6 3 1
3 24 4 1
2 72 5 1
. 1 1 2
1 2 2 2
7 16 3 2
Your code can be made a little more concise. Here's how:
input X count ID
. 1 1
2 2 1
1 3 1
3 4 1
2 5 1
. 1 2
1 2 2
7 3 2
end
gen Y = count == 1
bysort ID (count) : replace Y = (1 + X) * Y[_n-1] if count > 1
The creation of a dummy (indicator) variable can exploit the fact that true or false expressions are evaluated as 1 or 0.
Sorting before by and the subsequent by command can be condensed into one. Note that I spelled out that within blocks of ID, count should remain sorted.
This is really a comment, not another answer, but it would be less clear if presented as such.
Y-1, the lag in the formula would be translated as seen in the below.
gen Y = 0
replace Y = 1 if count == 1
sort ID
by ID: replace Y = (1+X)*Y[_n-1] if count != 1