How to get unique values for all variables in a dataset - stata

I'm using Stata. I have a dataset with approximately 1800 observations and 1050 variables. Most of them are categorical variables with a few categories. It looks something like this:
------------------------------------------------------
| id | fh_1 | fh_1a | fh_2 | fh_2a | fh_3 | fh_3a |...
------------------------------------------------------
|1111| 1 |closed | 2 | 4 | 1 | open |...
------------------------------------------------------
|1112| 2 | open | 1 | 2 | 3 | closed|...
------------------------------------------------------
.
.
.
I need to export to an Excel sheet the list of all variables in this dataset with all unique values for each variable. It should look something like this:
--------------------------
|variable | unique_values|
--------------------------
| fh | 1 2 3 4 5 |
--------------------------
|fh_1a | closed open |
--------------------------
.
.
.
I think I need a loop with the command levelsof but I'm not sure how to build it. Any suggestions?

foreach v of var * {
levelsof `v'
}
would be a start, but I haven't directly addressed how to make that output Excel-friendly.
One possibility is to put all the output in string variables given that the number of observations exceeds the number of variables.
gen varname = ""
gen levels = ""
local i = 1
foreach v of var * {
levelsof `v'
replace varname = "`v'" in `i'
replace levels = `"`r(levels)'"' in `i'
local ++i
}

Here is one way to solve it. You might run in to issues if you have strings variable where some observations have values that are strings composed of more than one word. Then there is no way to tell if it was one observation with both words or two observations with one word each.
The values are sorted alphabetically, so you might be able to figure out anyway, but it could be ambivalent.
sysuse auto,clear
* Get a list of all vars apart from whatever var we do not want to include
ds make, not
local all_vars_but_id `r(varlist)'
* Get the number of vars, represents the number of rows in the dataset to be exported
local num_vars : word count `all_vars_but_id'
* Get the values for each var and store in local with same name as var
foreach var of local all_vars_but_id {
levelsof `var'
local `var' `r(levels)'
}
*Preserve the original data
preserve
* Remove the data and set up the data set to be exported
clear
set obs `num_vars'
gen var = ""
gen values = ""
* Copy the value of the locals created abobe to one row per variable
local counter 1
foreach var of local all_vars_but_id {
replace var = "`var'" if _n == `counter'
replace values = "``var''" if _n == `counter'
local counter = `counter' + 1
}
* Export to Excel
export excel using "C:\path/to/file/unique_values.xls"
*Restore the original data
restore

Another option using levelsof
input id str6(var1 var2 var3)
1 "open" "2" "3"
2 "closed" "1" "2"
3 "open" "1" "1"
end
reshape long var, i(id)
rename var values
rename _j var
gen unique_values = ""
forvalues i = 1/3 {
levelsof values if var == `i'
replace unique_values = r(levels) if var == `i'
}
replace unique_values = subinstr(unique_values,"`","",.)
replace unique_values = subinstr(unique_values,`"""',"",.)
replace unique_values = subinstr(unique_values,"'","",.)
contract var unique_values
drop _freq
list, noobs

Related

Browse all the rows and columns that contain a zero

Suppose I have 100 variables named ID, var1, var2, ..., var99. I have 1000 rows. I want to browse all the rows and columns that contain a 0.
I wanted to just do this:
browse ID, var* if var* == 0
but it doesn't work. I don't want to hardcode all 99 variables obviously.
I wanted to essentially write an if like this:
gen has0 = 0
forvalues n = 1/99 {
if var`n' does not contain 0 {
drop v
} // pseudocode I know doesn't work
has0 = has0 | var`n' == 0
}
browse if has0 == 1
but obviously that doesn't work.
Do I just need to reshape the data so it has 2 columns ID, var with 100,000 rows total?
My dear colleague #NickCox forces me to reply to this (duplicate) question because he is claiming that downloading, installing and running a new command is better than using built-in ones when you "need to select from 99 variables".
Consider the following toy example:
clear
input var1 var2 var3 var4 var5
1 4 9 5 0
1 8 6 3 7
0 6 5 6 8
4 5 1 8 3
2 1 0 2 1
4 6 7 1 9
end
list
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
1. | 1 4 9 5 0 |
2. | 1 8 6 3 7 |
3. | 0 6 5 6 8 |
4. | 4 5 1 8 3 |
5. | 2 1 0 2 1 |
6. | 4 6 7 1 9 |
+----------------------------------+
Actually you don't have to download anything:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list var obsno value if value == 0, noobs
+----------------------+
| var obsno value |
|----------------------|
| var5 1 0 |
| var1 3 0 |
| var3 5 0 |
+----------------------+
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
var1 var3 var5
restore
This is the approach i recommended in the linked question for identifying negative values. Using levelsof one can do the same thing with findname using a built-in command.
This solution can also be adapted for browse:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
browse var obsno value if value == 0
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
pause
restore
Although i do not see why one would want to browse the results when can simply list them.
EDIT:
Here's an example more closely resembling the OP's dataset:
clear
set seed 12345
set obs 1000
generate id = int((_n - 1) / 300) + 1
forvalues i = 1 / 100 {
generate var`i' = rnormal(0, 150)
}
ds var*
foreach var in `r(varlist)' {
generate rr = runiform()
replace `var' = 0 if rr < 0.0001
drop rr
}
Applying the above solution yields:
display "`selectedvars'"
var13 var19 var35 var36 var42 var86 var88 var90
list id var obsno value if value == 0, noobs sepby(id)
+----------------------------+
| id var obsno value |
|----------------------------|
| 1 var86 18 0 |
| 1 var19 167 0 |
| 1 var13 226 0 |
|----------------------------|
| 2 var88 351 0 |
| 2 var36 361 0 |
| 2 var35 401 0 |
|----------------------------|
| 3 var42 628 0 |
| 3 var90 643 0 |
+----------------------------+
Short answer: wildcards for bunches of variables can't be inserted in if qualifiers. (The if command is different from the if qualifier.)
Your question is contradictory on what you want. At one point your pseudocode has you dropping variables! drop has a clear, destructive meaning to Stata programmers: it doesn't mean "ignore".
But let's stick to the emphasis on browse.
findname, any(# == 0)
finds variables for which any value is 0. search findname, sj to find the latest downloadable version.
Note also that
findname, type(numeric)
will return the numeric variables in r(varlist) (and also a local macro if you so specify).
Then several egen functions compete for finding 0s in each observation for a specified varlist: the command findname evidently helps you identify which varlist.
Let's create a small sandbox to show technique:
clear
set obs 5
gen ID = _n
forval j = 1/5 {
gen var`j' = 1
}
replace var2 = 0 in 2
replace var3 = 0 in 3
list
findname var*, any(# == 0) local(which)
egen zero = anymatch(`which'), value(0)
list `which' if zero
+-------------+
| var2 var3 |
|-------------|
2. | 0 1 |
3. | 1 0 |
+-------------+
So, the problem is split into two: finding the observations with any zeros and finding the observations with any zeros, and then putting the information together.
Naturally, the use of findname is dispensable as you can just write your own loop to identify the variables of interest:
local wanted
quietly foreach v of var var* {
count if `v' == 0
if r(N) > 0 local wanted `wanted' `v'
}
Equally naturally, you can browse as well as list: the difference is just in the command name.

Add and name column in a dataset

I'm trying in Stata to add columns to my dataset and name them year_2005,..., year_2017.
Here is my code:
gen a=.
forvalues i=2005(1)2015 {
replace a=(b>i)
rename a "year"+`i'
}
b is a numeric variable in my dataset.
Here's one way to do this:
clear
set obs 1
forvalues i = 1 / 15 {
if `i' < 10 local d 0
generate year_20`d'`i' = runiform()
}
Or alternatively (as per #NickCox comment - see Stata tip 85):
clear
set obs 1
forvalues i = 1 / 15 {
generate year_20`: display %02.0f `i'' = runiform()
}
Or using your example:
clear
set obs 1
forvalues i = 2005(1)2015 {
generate a = .
replace a = runiform()
rename a year_`i'
}

SAS: Rename variables in merge according to original dataset

I have two datasets, one for male and one for female, which contain identical variables. I need to find the percent difference between the sexes on each variable by group.
The datasets look something like this, but with more variables and groups,
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | F | 8 | 5 |
| 2 | F | 6 | 3 |
| 3 | F | 7 | 0 |
|-------+-----+------+------|
| Group | Sex | VarA | VarB |
|-------+-----+------+------|
| 1 | M | 9 | 7 |
| 2 | M | 8 | 5 |
| 3 | M | 6 | 3 |
|-------+-----+------+------|
The result I need is this:
| Group | percent_diffA | percent_diffB |
|-------+---------------+---------------|
| 1 | -0.117647059 | -0.333333333 |
| 2 | -0.285714286 | -0.5 |
| 3 | 0.153846154 | -2 |
|-------+---------------+---------------|
I could solve this via a merge by renaming each variable.
data difference;
merge
females (rename = (VarA = VarA_F VarB = VarB_F)
males (rename = (VarA = VarA_M VarB = VarB_M)
;
by group;
percent_diffA = (VarA_F - VarA_M) / ( (VarA_F + VarA_M) / 2 );
percent_diffB = (VarB_F - VarB_M) / ( (VarB_F + VarB_M) / 2 );
drop sex;
run;
However, this approach requires me to rename everything manually. With several variables, the rename statement becomes cumbersome. Unfortunately, this calculation is being interjected into some old code, so renaming the original datasets is not practical.
I'm wondering if there is another way to solve this problem which is less cumbersome.
EDIT: I have updated the variable names because that appears to have caused people confusion. They were originally called Var1 and Var2. They are now VarA and VarB. The real variable names are descriptive, for instance body_weight_g or gonadal_somatic_index. The variables are not simply listed with sequential numbers.
For a data set that contains variables that are sequentially numbered there is variable list syntax for renaming the whole range of variables:
This example creates sample that has 100 variables.
data have1 have2;
do group = 1 to 100;
sex = 'M';
array var(100);
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 42 then output have1;
sex = 'F';
do _n_ = 1 to dim(var);
var(_n_) = ceil (25 * ranuni(123));
end;
if group ne 100-42 then output have2;
end;
run;
The rename option works on all 100 variables.
data want;
merge
have1(rename=var1-var100=mvar1-mvar100 in=_M)
have2(rename=var1-var100=fvar1-fvar100 in=_F)
;
by group;
if _M & _F & first.group & last.group then do;
array one mvar1-mvar100;
array two fvar1-fvar100;
array results result1-result100;
do i = 1 to dim(results);
diff = one(i) - two(i);
mean = mean (one(i), two(i));
results(i) = diff / mean * 100;
end;
end;
keep group result:;
run;
Shenglin's answer is a nice and concise use of SQL.
An alternative method is constructing a macro variable specifying the renames to be used in the rename DSO (data set option). This can be done with an SQL query to the dictionary table containing the column names.
* This macro creates the macro variable rename_suffix, to be used in a rename statement or data set option ;
* It will be of form: var1 = var1_suffix var2 = var2_suffix ... ;
* &inset is the input set. &suffix is the suffix to added to all variables except for the variables specified in &keys. ;
* &keys variables should be given each in quotation marks, and separated by spaces. ;
%macro rename_list(inset, suffix, keys) ;
%global rename_&inset ; * So that this macro variable is accessable outside the macro ;
proc sql ;
select strip(name) || ' = ' || strip(name) || "_&suffix"
into :rename_&inset separated by ' '
from sashelp.vcolumn /* dictionary.columns can be used in place of sashelp.vcolumn */
where libname = 'WORK' & memname = "%sysfunc(upcase(&inset))"
& upcase(strip(name)) not in (' ' %sysfunc(upcase(&keys))); * The ' ' is included, so there is no error if no keys are given ;
quit ;
%mend rename_list ;
%rename_list(females, F, 'GROUP' 'SEX')
%rename_list(males , M, 'GROUP' 'SEX')
%put &rename_females ; * Check that the macro variables are correct ;
%put &rename_males ;
%macro pct_diff(num) ;
percent_diff&num = (Var&num._F - Var&num._M) / ( (Var&num._F + Var&num._M) / 2 ) ;
%mend pct_diff ;
data difference ;
merge females(rename = (&rename_females), drop = sex)
males (rename = (&rename_males ), drop = sex) ;
by group ;
pct_diff(1) ;
pct_diff(2) ;
run ;
dm 'vt difference';
The percent_diff variable creation can also be shortened with a macro (as shown). If you had a large and/or variable number of variables to compare, then you could further shorten it by automatically detecting the number of comparisons, by running the same SQL query with the select into part modified to be
select count(name) into :varct trimmed
to count the number of variables, and then use a do loop in the data step:
do i = 1 to &varct ;
%pct_diff(i) ;
end ;
Use table alias in proc sql to avoid name change:
proc sql;
select a.group,(a.var1-b.var1)/((a.var1+b.var1)/2) as percent_diff1,
(a.var2-b.var2)/((a.var2+b.var2)/2) as percent_diff2
from female as a,male as b
where a.group=b.group;
quit;

Sort letters in a string variable in alphabetical order

I need to sort a string variable letters in alphabetical order in Stata. Can someone suggest a command or a method to do it?
For example: I have a string variable with 1000 observations. So the method would sort the characters (letters) like this:
School--chloos
sort--orst
akramabad-dabamarka
For a dataset that size, the easiest way is possibly just to expand data briefly to a version with each character in a separate observation. Your question leaves open your rules on lower and upper case, but I'll take your example "School" to "chloos" literally as implying working with lower case.
clear
input str9 sandbox
"School"
"sort"
"akramabad"
end
gen length = length(sandbox)
gen id = _n
expand length
bysort id : gen char = substr(lower(sandbox), _n, 1)
sort id char
bysort id (char) : gen newbox = char[1]
by id: replace newbox = newbox[_n-1] + char if _n > 1
by id: replace newbox = newbox[_N]
by id: keep if _n == 1
drop length char
list
+----------------------------+
| sandbox id newbox |
|----------------------------|
1. | School 1 chloos |
2. | sort 2 orst |
3. | akramabad 3 aaaabdkmr |
+----------------------------+
Creating separate variables for each letter and sorting them within observations would also seem possible.

Stata: Using egen, anycount() when values vary for each observation

Each observation in my data presents a player who follows some random pattern. Variables move1 up represent on which moves each player was active. I need to count the number of times each player was active:
The data look as follows (with _count representing a variable that I would like to generate). The number of moves can also be different depending on simulation.
+------------+------------+-------+-------+-------+-------+-------+-------+--------+
| simulation | playerlist | move1 | move2 | move3 | move4 | move5 | move6 | _count |
+------------+------------+-------+-------+-------+-------+-------+-------+--------+
| 1 | 1 | 1 | 1 | 1 | 2 | . | . | 3 |
| 1 | 2 | 2 | 2 | 4 | 4 | . | . | 2 |
| 2 | 3 | 1 | 2 | 3 | 3 | 3 | 3 | 4 |
| 2 | 4 | 4 | 1 | 2 | 3 | 3 | 3 | 1 |
+------------+------------+-------+-------+-------+-------+-------+-------+--------+
egen combined with anycount() is not applicable in this case because the argument for the value() option is not a constant integer.
I have made an attempt to cycle through each observation and use egen rowwise (see below) but it keeps count as missing (as initialised) and is not very efficient (I have 50,000 observations). Is there a way to do this in Stata?
gen _count =.
quietly forval i = 1/`=_N' {
egen temp = anycount(move*), values( `=`playerlist'[`i']')
replace _count = temp
drop temp
}
You can easily cut out the loop over observations. In addition, egen is only to be used for convenience, never speed.
gen _count = 0
quietly forval j = 1/6 {
replace _count = _count + (move`j' == playerlist)
}
or
gen _count = move1 == playerlist
quietly forval j = 2/6 {
replace _count = _count + (move`j' == playerlist)
}
Even if you had been determined to use egen, the loop need only be over the distinct values of playerlist, not all the observations. Say the maximum is 42
gen _count = 0
quietly forval k = 1/42 {
egen temp = anycount(move*), value(`k')
replace _count = _count + temp
drop temp
}
But that's still a lousy method for your problem. (I wrote the original of anycount() so I can say why it was written.)
See also http://www.stata-journal.com/sjpdf.html?articlenum=pr0046 for a review of working rowwise.
P.S. Your code contains bugs.
You replace your count variable in all observations by the last value calculated for the count in the last observation.
Values are compared with a local macro playerlist. You presumably have no local macro of that name, so the macro is evaluated as empty. The result is that you end by comparing each value of your move* variables with the observation numbers. You meant to use the variable name playerlist, but the single quotation marks force the macro interpretation.
For the record, this fixes both bugs:
gen _count = .
quietly forval i = 1/`=_N' {
egen temp = anycount(move*), values(`= playerlist[`i']')
replace _count = temp in `i'
drop temp
}