Transforming variables using ‘foreach’ - stata

I have a set of variables. I want to transform each variable in this set (‘y’) as follows: y' = (y-min(y))(max(y)-min(y)). That is, for each observation of each variable, I want to subtract the minimum value of that variable, and then divide the result by the difference between the maximum and minimum values of that variable.
I want to implement this via a loop, using foreach, but coding it as above (using the min() and max() functions) produces an error message. Are there any alternatives? Or must this just be done manually?

You should be able to adapt the example below. The command summarize stores the values you need in your formula in the returned r() values.
sysuse auto
local vars price mpg weight length
foreach var of local vars {
summarize `var'
replace `var' = (`var' - r(min)) / (r(max) - r(min))
}

There is no data example in the question, nor any indication of your variable names. This is an example you can run.
sysuse auto, clear
ds, has(type numeric)
foreach v in `r(varlist)' {
su `v', meanonly
gen `v'_scaled = (`v' - r(min)) / (r(max) - r(min))
}
su *scaled
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
price_scaled | 74 .2278444 .2338086 0 1
mpg_scaled | 74 .3205965 .1995001 0 1
rep78_scaled | 69 .6014493 .2474831 0 1
headroom_s~d | 74 .4266409 .2417128 0 1
trunk_scaled | 74 .4864865 .2376336 0 1
-------------+---------------------------------------------------------
weight_sca~d | 74 .4089154 .2523356 0 1
length_sca~d | 74 .504752 .2446851 0 1
turn_scaled | 74 .4324324 .2199677 0 1
displaceme~d | 74 .3418997 .2654255 0 1
gear_ratio~d | 74 .4852146 .2684042 0 1
-------------+---------------------------------------------------------
foreign_sc~d | 74 .2972973 .4601885 0 1
The functions min() and max() in Stata require two or more arguments and operate rowwise in any case. They don't yield the minimum and maximum of a variable. You could use egen but the direct route of a loop and calling up summarize is preferable. Note that despite its name the meanonly option does produce the minimum and maximum.

Related

Generate sum of all possible combinations of id

I have a dataset with the structure that looks something like this:
Group ID Value
1 A 10
1 B 15
1 C 20
2 D 10
2 E 25
Within each Group, I want to obtain the sum of all possible combinations of two or more IDs. For instance, within group 1, I can have the following combinations: AB, AC, BC, ABC. So, in total I have four possible combinations for group 1, of which I'd like to get the sum of the variable value.
I am using the formula for combinations of N elements in groups of size R to identify how many observations I need to add to the dataset to have enough observations.
For Group 1, the number of observations I need are:
3!/((3-2)!*2!)*2 = 6 for the two-IDs combinations
3!/(3-3)!*3!)*3 = 3 for the three-IDs combination.
So a total of 9 observations. Since I already have three, I can use the command:expand 6 if Group==1. For Group 1 I would get something like
Group ID Value
1 A 10
1 B 15
1 C 20
1 A 10
1 B 15
1 C 20
1 A 10
1 B 15
1 C 20
Now, I am stuck here on how to proceed to tell Stata to identify the combinations and create the summation. Ideally, I want to create two new variables, to identify the tuples and get the summation, so something that looks like:
Group ID Value Touple Sum
1 A 10 AB 25
1 B 15 AB 25
1 A 10 AC 30
1 C 20 AC 30
1 B 15 BC 35
1 C 20 BC 35
1 A 10 ABC 45
1 B 15 ABC 45
1 C 20 ABC 45
In this way, I could then just drop the duplicates in terms of Group and Tuples. Once I have the Tuples variable, getting the sum is straightforward, but getting the Tuples, I can't get my head around it.
Any advice on how to do this?
I tried doing this with nested loops and the tuples command.
First I create and save a tempfile to store results:
clear
tempfile group_results
save `group_results', replace emptyok
Then I input and save data, along with a local for the number of groups:
clear
input Group str1 ID Value
1 A 10
1 B 15
1 C 20
2 D 10
2 E 25
2 F 13 // added to test
2 G 2 // added to test
end
sum Group
local num_groups = r(max)
tempfile base
save `base', replace
Here's the core of the code. The outer loop here iterates over Groups. Then it makes a list of the IDs in that group, and uses the tuples command to make a list of the unique combinations of those IDs, with a minimum size of 2. The k loop iterates through the number of tuples and the m loop makes an indicator for tuple membership.
forvalues i = 1/`num_groups' {
display "Starting Group `i'"
use `base' if Group==`i', clear
* Make list of IDs to get unique combos of
forvalues j = 1/`=_N' {
local tuple_list`i' = "`tuple_list`i'' " + ID[`j']
}
* Get all unique combos in list using tuples command
tuples `tuple_list`i'', display min(2)
forvalues k = 1/`ntuples' {
display "Tuple `k': `tuple`k''"
local length = wordcount("`tuple`k''")
gen intuple=0
gen tuple`k'="`tuple`k''"
forvalues m = 1/`length' {
replace intuple=1 if ID==word("`tuple`k''",`m')
}
* Calculate sum of values in that tuple
gegen group_sum`k' = sum(Value) if intuple==1
drop intuple
list
}
* Reshape into desired format
reshape long tuple group_sum, i(Group ID Value) j(tuple_num)
drop if missing(group_sum)
sort tuple_num
list
append using `group_results'
save `group_results', replace
}
* Full results
use `group_results', clear
sort Group tuple_num
list
I hope this helps. The list commands will give you a busy results window but it shows what's all happening. Here's the output at the end of the i loop for Group 1:
+--------------------------------------------------+
| Group ID Value tuple_~m tuple group_~m |
|--------------------------------------------------|
1. | 1 C 20 1 B C 35 |
2. | 1 B 15 1 B C 35 |
3. | 1 A 10 2 A C 30 |
4. | 1 C 20 2 A C 30 |
5. | 1 A 10 3 A B 25 |
|--------------------------------------------------|
6. | 1 B 15 3 A B 25 |
7. | 1 C 20 4 A B C 45 |
8. | 1 A 10 4 A B C 45 |
9. | 1 B 15 4 A B C 45 |
+--------------------------------------------------+
This could be inefficient if your data is actually much larger!

Hamming distance for grouped results

I'm working with a data set that contains 40 different participants, with each 30 observations.
As I am observing search behavior, I want to calculate the search distance for each subject per round (from 1 to30).
In order to compare my data with current literature, I need to use the Hamming distance to describe search distances.
The variable is called Inputs and is a string variable with binary inputs 0 or 1 with a length of 10. E.g:
Input Type 1 Subject 1 Round 1: 0000011111
Input Type 1 Subject 1 Round 2: 0000011110
Using the Levensthein distance, my approach was simple:
sort type_num Subject round_num
gen input_prev=Input[_n-1]
replace input_prev="0000000000" if round_num==1 //default starting position with 0000000000 to get search distance for first input in round 1
//Levensthein distance & clearing data (Levensthein instead of hamming distance)
ustrdist Input input_prev
rename strdist input_change
I am now struggling with getting the right Stata commands for the Hamming distance. Can someone help?
Does this help? As I understand it, Hamming distance is the count of characters (bits) that differ at corresponding positions of strings of equal length. So, given two variables and wanting comparisons within each observation, it is just a loop over the characters.
clear
set obs 10
set seed 2803
* sandbox
quietly forval j = 1/2 {
gen test`j' = ""
forval k = 1/10 {
replace test`j' = test`j' + strofreal(runiform() > (`j' * 0.3))
}
}
set obs 12
replace test1 = 10 * "1" in 11
replace test2 = test1 in 11
replace test1 = test1[11] in 12
replace test2 = 10 * "0" in 12
* calculation
gen wanted = 0
quietly forval k = 1/10 {
replace wanted = wanted + (substr(test1, `k', 1) != substr(test2, `k', 1))
}
list
+----------------------------------+
| test1 test2 wanted |
|----------------------------------|
1. | 1110001111 1001101000 7 |
2. | 1111011011 1101011111 2 |
3. | 1011001111 1110110111 5 |
4. | 0000111011 1011010100 8 |
5. | 1011011011 1111100110 6 |
|----------------------------------|
6. | 0011111100 0100011110 5 |
7. | 0011011011 0011111010 2 |
8. | 1010100011 1011000100 5 |
9. | 1110011011 1010010100 5 |
10. | 1001011111 0100111001 6 |
|----------------------------------|
11. | 1111111111 1111111111 0 |
12. | 1111111111 0000000000 10 |
+----------------------------------+

Browse all the rows and columns that contain a zero

Suppose I have 100 variables named ID, var1, var2, ..., var99. I have 1000 rows. I want to browse all the rows and columns that contain a 0.
I wanted to just do this:
browse ID, var* if var* == 0
but it doesn't work. I don't want to hardcode all 99 variables obviously.
I wanted to essentially write an if like this:
gen has0 = 0
forvalues n = 1/99 {
if var`n' does not contain 0 {
drop v
} // pseudocode I know doesn't work
has0 = has0 | var`n' == 0
}
browse if has0 == 1
but obviously that doesn't work.
Do I just need to reshape the data so it has 2 columns ID, var with 100,000 rows total?
My dear colleague #NickCox forces me to reply to this (duplicate) question because he is claiming that downloading, installing and running a new command is better than using built-in ones when you "need to select from 99 variables".
Consider the following toy example:
clear
input var1 var2 var3 var4 var5
1 4 9 5 0
1 8 6 3 7
0 6 5 6 8
4 5 1 8 3
2 1 0 2 1
4 6 7 1 9
end
list
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
1. | 1 4 9 5 0 |
2. | 1 8 6 3 7 |
3. | 0 6 5 6 8 |
4. | 4 5 1 8 3 |
5. | 2 1 0 2 1 |
6. | 4 6 7 1 9 |
+----------------------------------+
Actually you don't have to download anything:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list var obsno value if value == 0, noobs
+----------------------+
| var obsno value |
|----------------------|
| var5 1 0 |
| var1 3 0 |
| var3 5 0 |
+----------------------+
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
var1 var3 var5
restore
This is the approach i recommended in the linked question for identifying negative values. Using levelsof one can do the same thing with findname using a built-in command.
This solution can also be adapted for browse:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
browse var obsno value if value == 0
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
pause
restore
Although i do not see why one would want to browse the results when can simply list them.
EDIT:
Here's an example more closely resembling the OP's dataset:
clear
set seed 12345
set obs 1000
generate id = int((_n - 1) / 300) + 1
forvalues i = 1 / 100 {
generate var`i' = rnormal(0, 150)
}
ds var*
foreach var in `r(varlist)' {
generate rr = runiform()
replace `var' = 0 if rr < 0.0001
drop rr
}
Applying the above solution yields:
display "`selectedvars'"
var13 var19 var35 var36 var42 var86 var88 var90
list id var obsno value if value == 0, noobs sepby(id)
+----------------------------+
| id var obsno value |
|----------------------------|
| 1 var86 18 0 |
| 1 var19 167 0 |
| 1 var13 226 0 |
|----------------------------|
| 2 var88 351 0 |
| 2 var36 361 0 |
| 2 var35 401 0 |
|----------------------------|
| 3 var42 628 0 |
| 3 var90 643 0 |
+----------------------------+
Short answer: wildcards for bunches of variables can't be inserted in if qualifiers. (The if command is different from the if qualifier.)
Your question is contradictory on what you want. At one point your pseudocode has you dropping variables! drop has a clear, destructive meaning to Stata programmers: it doesn't mean "ignore".
But let's stick to the emphasis on browse.
findname, any(# == 0)
finds variables for which any value is 0. search findname, sj to find the latest downloadable version.
Note also that
findname, type(numeric)
will return the numeric variables in r(varlist) (and also a local macro if you so specify).
Then several egen functions compete for finding 0s in each observation for a specified varlist: the command findname evidently helps you identify which varlist.
Let's create a small sandbox to show technique:
clear
set obs 5
gen ID = _n
forval j = 1/5 {
gen var`j' = 1
}
replace var2 = 0 in 2
replace var3 = 0 in 3
list
findname var*, any(# == 0) local(which)
egen zero = anymatch(`which'), value(0)
list `which' if zero
+-------------+
| var2 var3 |
|-------------|
2. | 0 1 |
3. | 1 0 |
+-------------+
So, the problem is split into two: finding the observations with any zeros and finding the observations with any zeros, and then putting the information together.
Naturally, the use of findname is dispensable as you can just write your own loop to identify the variables of interest:
local wanted
quietly foreach v of var var* {
count if `v' == 0
if r(N) > 0 local wanted `wanted' `v'
}
Equally naturally, you can browse as well as list: the difference is just in the command name.

Generating dummies in Stata

I have a dataset in Stata of the following form
id | year
a | 1950
b | 1950
c | 1950
d | 1950
.
.
.
y | 1950
-----
a | 1951
b | 1951
c | 1951
d | 1951
.
.
.
y | 1951
-----
...
I'm looking for a quick way to rewrite the following code
gen dummya=1 if id=="a"
gen dummyb=1 if id=="b"
gen dummyc=1 if id=="c"
...
gen dummyy=1 if id=="y"
and
gen dummy50=1 if year==1950
gen dummy51=1 if year==1951
...
Note that all your dummies would be created as 1 or missing. It is almost always more useful to create them directly as 1 or 0. Indeed, that is the usual definition of dummies.
In general, it's a loop over the possibilities using forvalues or foreach, but the shortcut is too easy not to be preferred in this case. Consider this reproducible example:
. sysuse auto, clear
(1978 Automobile Data)
. tab rep78, gen(rep78)
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.90 2.90
2 | 8 11.59 14.49
3 | 30 43.48 57.97
4 | 18 26.09 84.06
5 | 11 15.94 100.00
------------+-----------------------------------
Total | 69 100.00
. d rep78?
storage display value
variable name type format label variable label
------------------------------------------------------------------------------
rep781 byte %8.0g rep78== 1.0000
rep782 byte %8.0g rep78== 2.0000
rep783 byte %8.0g rep78== 3.0000
rep784 byte %8.0g rep78== 4.0000
rep785 byte %8.0g rep78== 5.0000
That's all the dummies (some prefer to say "indicators") in one fell swoop through an option of tabulate.
For completeness, consider an example doing it the loop way. We imagine that years 1950-2015 are represented:
forval y = 1950/2015 {
gen byte dummy`y' = year == `y'
}
Two digit identifiers dummy50 to dummy15 would be unambiguous in this example, so here they are as a bonus:
forval y = 1950/2015 {
local Y : di %02.0f mod(`y', 100)
gen byte dummy`y' = year == `y'
}
Here byte is dispensable unless memory is very short, but it's good practice any way.
If anyone was determined to write a loop to create indicators for the distinct values of a string variable, that can be done too. Here are two possibilities. Absent an easily reproducible example in the original post, let's create a sandbox. The first method is to encode first, then loop over distinct numeric values. The second method is find the distinct string values directly and then loop over them.
clear
set obs 3
gen mystring = word("frog toad newt", _n)
* Method 1
encode mystring, gen(mynumber)
su mynumber, meanonly
forval j = 1/`r(max)' {
gen dummy`j' = mynumber == `j'
label var dummy`j' "mystring == `: label (mynumber) `j''"
}
* Method 2
levelsof mystring
local j = 1
foreach level in `r(levels)' {
gen dummy2`j' = mystring == `"`level'"'
label var dummy2`j' `"mystring == `level'"'
local ++j
}
describe
Contains data
obs: 3
vars: 8
size: 96
------------------------------------------------------------------------------
storage display value
variable name type format label variable label
------------------------------------------------------------------------------
mystring str4 %9s
mynumber long %8.0g mynumber
dummy1 float %9.0g mystring == frog
dummy2 float %9.0g mystring == newt
dummy3 float %9.0g mystring == toad
dummy21 float %9.0g mystring == frog
dummy22 float %9.0g mystring == newt
dummy23 float %9.0g mystring == toad
------------------------------------------------------------------------------
Sorted by:
Use i.<dummy_variable_name>
For example, in your case, you can use following command for regression:
reg y i.year
I also recommend using
egen year_dum = group(year)
reg y i.year_dum
This can be generalized arbitrarily, and you can easily create, e.g., year-by-state fixed effects this way:
egen year_state_dum = group(year state)
reg y i.year_state_dum

Row-wise count/sum of values in Stata

I have a dataset where each person (row) has values 0, 1 or . in a number of variables (columns).
I would like to create two variables. One that includes the count of all the 0 and one that has the count of all the 1 for each person (row).
In my case, there is no pattern in the variable names. For this reason I create a varlist of all the existing variables excluding the ones that need not to be counted.
+--------+--------+------+------+------+------+------+----------+--------+
| ID | region | Qa | Qb | C3 | C4 | Wa | count 0 | count 1|
+--------+--------+------+------+------+------+------+----------+--------+
| 1 | A | 1 | 1 | 1 | 1 | . | 0 | 4 |
| 2 | B | 0 | 0 | 0 | 1 | 1 | 3 | 2 |
| 3 | C | 0 | 0 | . | 0 | 0 | 4 | 0 |
| 4 | D | 1 | 1 | 1 | 1 | 0 | 0 | 4 |
+--------+--------+------+------+------+------+------+----------+--------+
The following works, however, I cannot add an if statement
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_1s = rowtotal(`varlist')
If I change the last line with the one below, I get an error of invalid syntax.
egen count_of_1s = rowtotal(`varlist') if `v' == 1
I turned from count to summing because I thought this is a sneaky way out of the problem. I could change the values from 0,1 to 1, 2, then sum all the two values separately in two different variables and then divide accordingly in order to get the actual count of 1 or 2 per row.
I found this Stata: Using egen, anycount() when values vary for each observation however Stata freezes as my dataset is quite large (100.000 rows and 3000 columns).
Any help will be very appreciated :-)
Solution based on the response of William
* number of total valid responses (0s and 1s, excluding . )
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_nonmiss = rownonmiss(`varlist') // this counts all the 0s and 1s (namely, the non missing values)
* total numbers of 1s per row
ds ID region count_of_nonmiss, not // CAUTION: count_of_nonmiss needs not to be taken into account for this!
return list
local varlist = r(varlist)
generate count_of_1s = rowtotal(`varlist')
How about
egen count_of_nonmiss = rownonmiss(`varlist')
generate count_of_0s = count_of_nonmiss - count_of_1s
When the value of the macro varlist is substituted into your if clause, the command expands to
egen count_of_1s = rowtotal(`varlist') if Qa Qb C3 C4 Wa == 1
Clearly a syntax error.
I had the same problem to count the occurrences of specifies values in each observation across a set of variables.
I could resolve that problem in the following ways: If you want to count the occurrences of 0 in the values across x1-x2, so
clear
input id x1 x2 x3
id x1 x2 x3
1. 1 1 0 2
2. 2 2 0 2
3. 3 2 0 3
4. end
egen count2 = anycount(x1-x3), value(0)