Combine several rows into one observation - stata

I have a dataset in Stata where one observation is spread out over multiple rows like the table below. The variables are string except for the id, and there exist some duplicate entries for some variables (like the last row in the table).
id
var1
var2
var3
1
name1
1
name2
1
name3
2
name4
2
name5
3
name6
3
name8
3
name9
I want to take the first value and combine all variables to one row / observation. I think this is a really easy task but somehow I don't manage to figure it out.
id
var1
var2
var3
1
name1
name2
name3
2
name4
name5
3
name6
name8

It looks like collapse's service here.
collapse (firstnm) var*, by(id)

I am going to assume as implied in text that name9 is really the same as name8. That being so, here is one solution.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id str5(var1 var2 var3)
1 "name1" "" ""
1 "" "name2" ""
1 "" "" "name3"
2 "name4" "" ""
2 "" "name5" ""
3 "name6" "" ""
3 "" "" "name8"
3 "" "" "name8"
end
forval j = 1/3 {
bysort id (var`j') : replace var`j' = var`j'[_N]
}
duplicates drop
+----------------------------+
| id var1 var2 var3 |
|----------------------------|
1. | 1 name1 name2 name3 |
2. | 2 name4 name5 |
3. | 3 name6 name8 |
+----------------------------+
EDIT In the event that what is wanted is the first non-missing string value, collapse remains the solution of choice, but here is a solution without it.
clear
input byte id str5(var1 var2 var3)
1 "name1" "" ""
1 "" "name2" ""
1 "" "" "name3"
2 "name4" "" ""
2 "" "name5" ""
3 "name6" "" ""
3 "" "" "name8"
3 "" "" "name9"
end
gen long obsno = _n
forval j = 1/3 {
bysort id : egen firstnm = min(cond(var`j' != "", obsno, .))
replace var`j' = var`j'[firstnm]
drop firstnm
}
drop obsno
duplicates drop
list

Related

pick the last record in a list of variables/ columns

I have a dataset in wide format in Stata and I would like to pick the last observation of each variable. In the example below, I would like to generate a new variable based on the last observation of the list of variables.
I tried the code below and it doesn't work. My thought was to pick one variable at a time, e.g. v1==1
id v1 v2 v3 new varible
1 1 2 2
2 1 2 3 3
3 1 1
4 1 4 4
gen new_variable=.
foreach v of varlist v*{
replace new_variable=1 if `v'==1
replace new_variable=2 if `v'==2
replace new_variable=3 if `v'==3
}
You want the last non-missing value in each observation (row, record, case) over a series of variables (columns, fields). Terminology in your question is confused.
I first interpret the blanks in your data example as numeric missing values. That being so, what you want is given by the egen function rowlast(). It can also be obtained by looping as follows
Initialise with the first variable.
Looping over the other variables, replace if each variable is not missing.
* Example generated by -dataex-. For more info, type help dataex
clear
input byte(v1 v2 v3) float wanted
1 2 . 2
1 2 3 3
1 . . 1
1 4 . 4
end
egen WANTED = rowlast(v1 v2 v3)
gen wAnTeD = v1
forval j = 2/3 {
replace wAnTeD = v`j' if !missing(v`j')
}
list
+-----------------------------------------+
| v1 v2 v3 wanted WANTED wAnTeD |
|-----------------------------------------|
1. | 1 2 . 2 2 2 |
2. | 1 2 3 3 3 3 |
3. | 1 . . 1 1 1 |
4. | 1 4 . 4 4 4 |
+-----------------------------------------+
I next interpret the data as string variables. The egen solution doesn't work but the loop idea does work. Note that missing means empty strings "": spaces must be removed or ignored.
* Example generated by -dataex-. For more info, type help dataex
clear
input str1(v1 v2 v3 wanted)
"1" "2" "" "2"
"1" "2" "3" "3"
"1" "" "" "1"
"1" "4" "" ""
end
gen WANTED = v1
forval j = 2/3 {
replace WANTED = v`j' if !missing(v`j')
}

How to get unique values for all variables in a dataset

I'm using Stata. I have a dataset with approximately 1800 observations and 1050 variables. Most of them are categorical variables with a few categories. It looks something like this:
------------------------------------------------------
| id | fh_1 | fh_1a | fh_2 | fh_2a | fh_3 | fh_3a |...
------------------------------------------------------
|1111| 1 |closed | 2 | 4 | 1 | open |...
------------------------------------------------------
|1112| 2 | open | 1 | 2 | 3 | closed|...
------------------------------------------------------
.
.
.
I need to export to an Excel sheet the list of all variables in this dataset with all unique values for each variable. It should look something like this:
--------------------------
|variable | unique_values|
--------------------------
| fh | 1 2 3 4 5 |
--------------------------
|fh_1a | closed open |
--------------------------
.
.
.
I think I need a loop with the command levelsof but I'm not sure how to build it. Any suggestions?
foreach v of var * {
levelsof `v'
}
would be a start, but I haven't directly addressed how to make that output Excel-friendly.
One possibility is to put all the output in string variables given that the number of observations exceeds the number of variables.
gen varname = ""
gen levels = ""
local i = 1
foreach v of var * {
levelsof `v'
replace varname = "`v'" in `i'
replace levels = `"`r(levels)'"' in `i'
local ++i
}
Here is one way to solve it. You might run in to issues if you have strings variable where some observations have values that are strings composed of more than one word. Then there is no way to tell if it was one observation with both words or two observations with one word each.
The values are sorted alphabetically, so you might be able to figure out anyway, but it could be ambivalent.
sysuse auto,clear
* Get a list of all vars apart from whatever var we do not want to include
ds make, not
local all_vars_but_id `r(varlist)'
* Get the number of vars, represents the number of rows in the dataset to be exported
local num_vars : word count `all_vars_but_id'
* Get the values for each var and store in local with same name as var
foreach var of local all_vars_but_id {
levelsof `var'
local `var' `r(levels)'
}
*Preserve the original data
preserve
* Remove the data and set up the data set to be exported
clear
set obs `num_vars'
gen var = ""
gen values = ""
* Copy the value of the locals created abobe to one row per variable
local counter 1
foreach var of local all_vars_but_id {
replace var = "`var'" if _n == `counter'
replace values = "``var''" if _n == `counter'
local counter = `counter' + 1
}
* Export to Excel
export excel using "C:\path/to/file/unique_values.xls"
*Restore the original data
restore
Another option using levelsof
input id str6(var1 var2 var3)
1 "open" "2" "3"
2 "closed" "1" "2"
3 "open" "1" "1"
end
reshape long var, i(id)
rename var values
rename _j var
gen unique_values = ""
forvalues i = 1/3 {
levelsof values if var == `i'
replace unique_values = r(levels) if var == `i'
}
replace unique_values = subinstr(unique_values,"`","",.)
replace unique_values = subinstr(unique_values,`"""',"",.)
replace unique_values = subinstr(unique_values,"'","",.)
contract var unique_values
drop _freq
list, noobs

Importing multiple text (txt) files into SAS (files have variable attributes in first 2 rows)

I am trying to import multiple text files into SAS. The peculiarity of the data is that the first row has the labels for some of the variables and the second row has text indicating type of some of the variables. The third row has the variable names.
I was intending to use a macro to read the files as the first 7 variables have the same names. I am not sure how to programmatically handle the variable attributes in the files. Please suggest how I could do this.
The code so far:
%macro text2sas(filenam=);
proc import datafile="../&filenam..txt"
out="&filenam"
dbms=dlm replace ;
delimiter = '09'x;
getnames=no;
datarow=1;
guessingrows=max;
run;
%mend text2sas;
%text2sas(filenam=convdat);
%text2sas(filenam=tratdat);
The data for convdat.txt looks like this:
"Dance retail:" "Dummy measurement completed successfully?" "Dramatic measurements?" "Maximal travel :" "Velocity time at start:" "Mean velocity at start:" "Maximal velocity at end:" "Velocity time iinterval:" "Mean velocity interval:" "Crain Dp:"
date string string number number number number number number number
RELAXT RAIN PLUCK RAPPLE VRAT GROSS PANGLE "Straint" "Etramp" "Crumpa" "Cafin" "Cafinat" "Cafinab" "Cafinavr" "Cafinap" "cafinal"
X5980B00099 "CF" G0001001 1234 "Vlapa1" 1 "Crt appoi" "10-May-2010" "1" "1" "" "" "" "" "" "" ""
X5980B00099 "CF" G0001002 1234 "Vlapa1" 1 "Crt appoi" "13-May-2010" "1" "1" "" "" "" "" "" "" ""
X5980B00099 "CF" G0001003 1234 "Vlapa1" 1 "Crt appoi" "19-may-2010" "1" "1" "" "" "" "" "" "" ""
X5980B00099 "CF" G0001004 1234 "Vlapa1" 1 "Crt appoi" "26-may-2010" "1" "1" "0.45" "0.55" "0.98" "0.76" "0.98" "0.12" "5.77"
Data for tratdat looks like this:
"Arbitrary carpets" "Household items" "Garage material" "Sundry data (everything else)" "Vehicle number" "Strains" "ITM" "Finals" "Dreadspan" "Printers" "Comment 1" "comment 2" "Grapple" "Drops" "Triangles"
boolean boolean boolean boolean boolean boolean boolean boolean boolean boolean string boolean boolean boolean boolean
RELAXT RAIN PLUCK RAPPLE VRAT GROSS PANGLE "Ant" "App" "Cro" "BRon" "Dramas" "Slacks" "CRAT" "Frob" "Rilo" "Ph7jj" "P10rt" "Irup" "GLk2" "Dap3" "Oreta"
X5980B00099 "GB" G0001001 1234 "Vlapa1" 1 "Pangolin train" "" "checked" "" "checked" "" "checked" "checked" "" "" "" "" "" "" "" ""
X5980B00099 "GB" G0001002 1234 "Vlapa1" 1 "Pangolin train" "" "" "" "checked" "" "checked" "checked" "" "" "" "" "" "" "" ""
X5980B00099 "GB" G0001003 1234 "Vlapa1" 1 "Pangolin train" "checked" "" "" "checked" "" "checked" "checked" "" "" "" "" "" "" "" ""
X5980B00099 "GB" G0001004 1234 "Vlapa1" 1 "Pangolin train" "checked" "" "" "checked" "" "checked" "checked" "" "" "" "" "" "" "" ""
The ultimate input will involve telling SAS to go to row 3, but as Reeza notes, you will lose your metadata if you just skip to Datarow=4.
I recommend parsing the file in a preprocessing step, and converting that metadata into input statements. This may be complicated, but it shouldn't be too bad... it is however outside the scope of a StackOverflow answer.
You can look into my presentations Writing Code With Your Data and Documentation Driven Programming (co-author) to see what kind of things you can do as far as writing the input statements. You don't have exactly what either of these expect, but you can input those first few lines using data step input and then transpose that dataset to a more useful format.
Looks like the first three lines have the LABEL, TYPE and NAME for the columns. So read that first and use the information to generate code to read the actual lines of data.
Something like this:
data headers ;
length row col 8 type $32 value $200 ;
infile file2 dsd dlm='09'x truncover length=ll column=cc ;
do type='LABEL','TYPE','NAME';
row+1;
do col=1 by 1 until(cc>ll);
input value # ;
if not missing(value) then output;
end;
input;
end;
stop;
run;
proc sort; by col row; run;
proc transpose data=headers out=meta(drop=_name_) ;
by col;
id type ;
var value;
run;
Which for that second file should get data like:
Obs col NAME LABEL TYPE
1 1 RELAXT
2 2 RAIN
3 3 PLUCK
4 4 RAPPLE
5 5 VRAT
6 6 GROSS
7 7 PANGLE
8 8 Ant Arbitrary carpets boolean
9 9 App Household items boolean
10 10 Cro Garage material boolean
11 11 BRon Sundry data (everything else) boolean
12 12 Dramas Vehicle number boolean
13 13 Slacks Strains boolean
14 14 CRAT ITM boolean
15 15 Frob Finals boolean
16 16 Rilo Dreadspan boolean
17 17 Ph7jj Printers boolean
18 18 P10rt Comment 1 string
19 19 Irup comment 2 boolean
20 20 GLk2 Grapple boolean
21 21 Dap3 Drops boolean
22 22 Oreta Triangles boolean
Which you might use to generate code like:
data want ;
infile file2 dsd dlm='09'x truncover firstobs=4 ;
input
RELAXT :$20.
RAIN :$5.
PLUCK :$20.
RAPPLE
VRAT :$20.
GROSS
PANGLE :$40.
Ant :$1.
App :$1.
Cro :$1.
BRon :$1.
Dramas :$1.
Slacks :$1.
CRAT :$1.
Frob :$1.
Rilo :$1.
Ph7jj :$1.
P10rt :$50.
Irup :$1.
GLk2 :$1.
Dap3 :$1.
Oreta :$1.
;
label
Ant ="Arbitrary carpets"
App ="Household items"
Cro ="Garage material"
BRon ="Sundry data (everything else)"
Dramas ="Vehicle number"
Slacks ="Strains"
CRAT ="ITM"
Frob ="Finals"
Rilo ="Dreadspan"
Ph7jj ="Printers"
P10rt ="Comment 1"
Irup ="comment 2"
GLk2 ="Grapple"
Dap3 ="Drops"
Oreta ="Triangles"
;
run;

Stata: Using if with value labels

I faced an issue using if with value labels.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 4 "cat3" 5 "cat3"
label val var1 l_var1
keep if var1=="cat3":l_var1
(4 observations deleted)
I expected 3 records to be deleted. How can I achieve this?
I am using Stata 16.1.
"cat3":l_var1 does not look up all values in l_var1 that corresponds to "cat3". It returns the first value that corresponds to the string "cat3".
So "cat3":l_var1 evaluates to 4 so keep if var1=="cat3":l_var1 evaluates to keep if var1==4 and therefore only one observation is kept.
See code below that shows this behavior. This is not the way you seem to want "cat3":l_var1 to behave, but this is how it behaves.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 5 "cat3" 4 "cat3"
label val var1 l_var1
gen var2 = "cat3":l_var1
gen var3 = 1 if var1=="cat3":l_var1
This answers what is going on in your code. The code below is a better way to solve what you are trying to do.
set obs 5
gen var1 = _n
label define l_var1 1 "cat1" 2 "cat1" 3 "cat2" 5 "cat3" 4 "cat3"
label val var1 l_var1
decode var1, generate(var_str)
keep if var_str == "cat3"

Browse all the rows and columns that contain a zero

Suppose I have 100 variables named ID, var1, var2, ..., var99. I have 1000 rows. I want to browse all the rows and columns that contain a 0.
I wanted to just do this:
browse ID, var* if var* == 0
but it doesn't work. I don't want to hardcode all 99 variables obviously.
I wanted to essentially write an if like this:
gen has0 = 0
forvalues n = 1/99 {
if var`n' does not contain 0 {
drop v
} // pseudocode I know doesn't work
has0 = has0 | var`n' == 0
}
browse if has0 == 1
but obviously that doesn't work.
Do I just need to reshape the data so it has 2 columns ID, var with 100,000 rows total?
My dear colleague #NickCox forces me to reply to this (duplicate) question because he is claiming that downloading, installing and running a new command is better than using built-in ones when you "need to select from 99 variables".
Consider the following toy example:
clear
input var1 var2 var3 var4 var5
1 4 9 5 0
1 8 6 3 7
0 6 5 6 8
4 5 1 8 3
2 1 0 2 1
4 6 7 1 9
end
list
+----------------------------------+
| var1 var2 var3 var4 var5 |
|----------------------------------|
1. | 1 4 9 5 0 |
2. | 1 8 6 3 7 |
3. | 0 6 5 6 8 |
4. | 4 5 1 8 3 |
5. | 2 1 0 2 1 |
6. | 4 6 7 1 9 |
+----------------------------------+
Actually you don't have to download anything:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
list var obsno value if value == 0, noobs
+----------------------+
| var obsno value |
|----------------------|
| var5 1 0 |
| var1 3 0 |
| var3 5 0 |
+----------------------+
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
var1 var3 var5
restore
This is the approach i recommended in the linked question for identifying negative values. Using levelsof one can do the same thing with findname using a built-in command.
This solution can also be adapted for browse:
preserve
generate obsno = _n
reshape long var, i(obsno)
rename var value
generate var = "var" + string(_j)
browse var obsno value if value == 0
levelsof var if value == 0, local(selectedvars) clean
display "`selectedvars'"
pause
restore
Although i do not see why one would want to browse the results when can simply list them.
EDIT:
Here's an example more closely resembling the OP's dataset:
clear
set seed 12345
set obs 1000
generate id = int((_n - 1) / 300) + 1
forvalues i = 1 / 100 {
generate var`i' = rnormal(0, 150)
}
ds var*
foreach var in `r(varlist)' {
generate rr = runiform()
replace `var' = 0 if rr < 0.0001
drop rr
}
Applying the above solution yields:
display "`selectedvars'"
var13 var19 var35 var36 var42 var86 var88 var90
list id var obsno value if value == 0, noobs sepby(id)
+----------------------------+
| id var obsno value |
|----------------------------|
| 1 var86 18 0 |
| 1 var19 167 0 |
| 1 var13 226 0 |
|----------------------------|
| 2 var88 351 0 |
| 2 var36 361 0 |
| 2 var35 401 0 |
|----------------------------|
| 3 var42 628 0 |
| 3 var90 643 0 |
+----------------------------+
Short answer: wildcards for bunches of variables can't be inserted in if qualifiers. (The if command is different from the if qualifier.)
Your question is contradictory on what you want. At one point your pseudocode has you dropping variables! drop has a clear, destructive meaning to Stata programmers: it doesn't mean "ignore".
But let's stick to the emphasis on browse.
findname, any(# == 0)
finds variables for which any value is 0. search findname, sj to find the latest downloadable version.
Note also that
findname, type(numeric)
will return the numeric variables in r(varlist) (and also a local macro if you so specify).
Then several egen functions compete for finding 0s in each observation for a specified varlist: the command findname evidently helps you identify which varlist.
Let's create a small sandbox to show technique:
clear
set obs 5
gen ID = _n
forval j = 1/5 {
gen var`j' = 1
}
replace var2 = 0 in 2
replace var3 = 0 in 3
list
findname var*, any(# == 0) local(which)
egen zero = anymatch(`which'), value(0)
list `which' if zero
+-------------+
| var2 var3 |
|-------------|
2. | 0 1 |
3. | 1 0 |
+-------------+
So, the problem is split into two: finding the observations with any zeros and finding the observations with any zeros, and then putting the information together.
Naturally, the use of findname is dispensable as you can just write your own loop to identify the variables of interest:
local wanted
quietly foreach v of var var* {
count if `v' == 0
if r(N) > 0 local wanted `wanted' `v'
}
Equally naturally, you can browse as well as list: the difference is just in the command name.