I have the following data and I would like to apply the log() function:
v1
2
3
4
-1
5
Expected output:
v1
2 0.30 ~ log(2)
3 0.48 ~ log(3)
4 0.60 ~ log(4)
-1 .
5 0.70 ~ log(5)
This is just a simplified version of the problem. There are 35000 observations in my dataset and I could not find any simple rules like drop if v1 <= 0 to solve this problem.
Without screening my data first, one method in my mind is to use for loop and run the log() function over the observations. However, I couldn't find any websites telling me how to do that.
Stata will return missing if asked to take the logarithm of zero or negative values. But
generate log_x = log(x)
and
generate log_x = log(x) if x > 0
will have precisely the same result, missings in the observations with problematic values.
The bigger question here is statistical. Why do you want to take logarithms of such a variable any way? If your idea is to transform a variable, then other transformations are available. If the variable is a response or outcome variable, then a generalized linear model with logarithmic link will work even if there are some zero or negative values; the idea is just that the mean function should remain positive.
There have been many, many threads raising these issues on Cross Validated and Statalist.
I can't imagine why you think a loop is either needed or helpful here. With generate statements of the kind above, Stata automatically loops over observations.
Related
I have an array of numbers and for every number I want to check if it's greater than a value in another cell and if it is greater I want to add the difference to the total sum.
I have succeeded to do this "manually" for an amount of cells but there must be a better way.
For simplicity I just compared the value to 10 but it will be another cell.
=sum(if(A1>=10,A1-10,0),if(A2>=10,A2-10,0),if(A3>=10,A3-10,0))
The formula abohe yields the expected result for A1:A3.
What unfortunately doesn't work is:
=SUM(if(A1:A3>=10,A1:A3-10,0))
At the end I changed my approach to arrive at the solution:
=SUMIF(A1:A3,">10") - COUNTIF(A1:A3,">10") * 10
So instead of summing the differences directly, we sum the appropriate values and then subtract the reference as often as we summed up.
Try with this:
=SUM(ARRAYFORMULA(IF(A:A="","",IF(A:A<=10,10-A:A,0))))
try:
=BYROW(A1:A20, LAMBDA(a, IF(a>=10, a-10, 0)))
or:
=SUMPRODUCT(BYROW(A1:A20, LAMBDA(a, IF(a>=10, a-10, 0))))
I am trying to identify values that are not integers in Stata. My dataset is the following:
var1 var2 var3
1 2 3
2 4 5
3 6 7
4 2 3
5 1 1
6 2 8
My code is the following:
foreach var in var1 var2 var3 {
gen flag_`var' = 1 if format(`var') == %int
replace flag_`var' = 0 if flag_`var' ==.
I am getting an error message stating
unknown function format()
}
I also tried replacing the parentheses around format(`var') with format[`var'] but then I got an error stating format not found. Is there something wrong with the format I am using or is there a better way to identify non-integer values?
The first answer is what Stata told you: there is no format() function.
But a deeper answer is that thinking of (display) formats is the wrong way round for this question. A display format is in essence an instruction to show data in a certain way and has nothing to do with its stored value, or to be more precise the decimal equivalent of its stored value. Thus 42 displayed with format %4.3f is shown as 42.000 while 6.789 displayed with format %1.0f is shown as 7. Otherwise put, no value has an inherent format, but a display format is used to display a value, either by default or because a user specified a format. Stata is here just using the same broad ideas as say C and various C-like languages.
Nothing to do with its stored value is a slight exaggeration, as only numeric formats make sense for numbers and only string formats make sense for strings, but display format has nothing to do with whether a stored value is integer.
Further %int is not a display format any way. When formats are being checked for, they would be literal strings enclosed in "".
To show non-integers various methods could be used, say using rounding functions such as round(), int(), floor() or ceil(). So an indicator for whether x is integer could be
gen is_int_x = x == floor(x)
All the values in your data example are integer any way, but I take it that you are looking for non-integers elsewhere.
I am doing a quasi experimentation and am interested in getting ATT. i have a data with 260k entries where Ti = 0 and 5k entries where Ti = 1. I am calculating ATT with iptw technique where I achieve a great balance and the treatment effect -on treated as -ve 450 euros but not significant.
Weight calculation:
(If treatment = 1, weight = 1 else propensity score / (1-propensity score)
Then, to compare against other methodology, I use nearest neighbour matching with ratio = 1, the balance is again achieved. I get treatment effect (which is ATT by default in matching) as +very 750 and significant.
Shouldn't both method generate similar result ? Which method should I go for in this case and why?
When you match, is there any treated individuals without a match?
In expectation, IPTW and matching should both give the same answer. One possible explanation is that some treated individuals don't have a close match, so they are dropped. When this happens, the population for which the causal effect is defined changes. This could results in different answer between the methods
Each of these methods needs to be evaluated differently.
For IPW you need to check that you did not get samples with extremely low (or extremely high) propensity. If they are close to 0 or 1 then you need to evaluate why this happened and probably remove samples like that from the data. Since your labels are very unbalanced this could certainly happen.
For matching, like #pzivich said, you need to check if there are samples that did not get matched (similar to a very low propensity)
Finally, I like checking the balancing on held-out data to check that there is no over-fitting.
Let's say you have a survey dataset, with 12 variables that stem from the same question, and each variable reports a response option for that question (multiple-response options possible for this question). Each variable (i.e. response option) is numeric with yes/no options. I am trying to combine all of these variables into one, so that I can do cross-tabs with other variables such as village name, and draw out the frequencies of each individual response and graphs nicely without extensive formatting. Does anyone have a solution to this: either to combine the variables or to do a multivariable cross-tab that doesn't require a lot of time spent on formatting?
Example data:
A B C D E F
1 0 1 0 1 0
0 0 1 0 1 1
1 1 1 0 0 0
There are many tricks and techniques here.
Tricks include using egen's concat() function as well as the group() function mentioned by #Dimitriy V. Masterov.
Techniques include special tabulation or listing commands, including tabm and groups on SSC and mrtab at the Stata Journal; on the last, see this article.
See also this article in the Stata Journal for a general discussion of handling multiple responses.
Does egen pattern = group(A-F), label do what you desire? If not, perhaps you can clarify what the desired transformation would look like for the 3 respondents you have shown.
I have a 2D matrix of size [3][x] filled with numbers. I want to pick say x numbers from this matrix based on the condition
exactly one number from each column.
up to a Max of 'm' numbers from each row (total of all the 3 rows should be x numbers and 3m > x)
I want to find the least possible sum of these selected x numbers.
I was able to pick the numbers based on iterative approach of finding the 'x' small numbers based on above conditions from the matrix. But my answer is not optimal.
E.g.:
5 9 . . . .
6 15 . . . .
7 19 . . . .
Lets say 5 is picked up initially(so 6 and 7 cannot be picked now). Later on we try to pick 9 but if m elements of row(0) are over we will have to pick 15. Now our solution will be 5+15 = 20 but we could have used 6+9 = 15 as optimal solution.
I am trying to optimize my solution and looking for better algorithms. Can someone provide me some good idea for optimal solution?
The problem reminds me of this one: http://projecteuler.net/problem=345
The Hungarian algorithm might work: http://en.wikipedia.org/wiki/Hungarian_algorithm