Collapsing multiple rows into a single row based on a common identifier - stata

Working in Stata, suppose I have a data table like this...
Household Identifier
Person Identifier
Var1
Var2
1
1
a
b
1
1
c
d
1
2
e
f
2
1
g
h
2
1
i
j
2
1
k
l
2
2
m
n
2
2
o
p
3
1
q
r
I want to be able to combine these so there is just one observation per household, i.e. like this
Household Identifier
Person1_Var1_1
Person1_Var2_1
Person1_Var1_2
Person1_Var2_2
Person1_Var3_1
Person1_Var3_2
Person2_Var1_1
Person2_Var2_1
Person2_Var1_2
Person2_Var2_2
Person2_Var3_1
Person2_Var3_2
1
a
b
c
d
.
.
e
f
.
.
.
.
2
g
h
i
j
k
l
m
n
o
p
.
.
3
q
r
.
.
.
.
.
.
.
.
.
.
Is there a straightforward way of doing this?

You can use reshape wide twice. Note that when I create rowid, I add an underscore to it; I also add underscore to the var1 and var2 columns. In the first reshape call, I use string to identify rowid as a string variable
bysort householdidentifier personidentifier: gen rowid = strofreal(_n) + "_"
rename var* =_
reshape wide var1 var2, i(householdidentifier personidentifier) j(rowid) string
reshape wide var*, i(householdidentifier) j(personidentifier)
Output:
househ~r var1_1_1 var2_1_1 var1_2_1 var2_2_1 var1_3_1 var2_3_1 var1_1_2 var2_1_2 var1_2_2 var2_2_2 var1_3_2 var2_3_2
1. 1 a b c d e f
2. 2 g h i j k l m n o p
3. 3 q r

Related

Merge data in Stata by creating new variables

I have two sets of data below in Stata
name
a
b
name
case #
content
a
1
o
a
2
p
a
3
q
b
1
r
b
2
s
How do I turn them into:
name
1st case
2nd case
3rd case
a
o
p
q
b
r
s
clear
input str1(name) case str1(content)
a 1 o
a 2 p
a 3 q
b 1 r
b 2 s
end
reshape wide content, i(name) j(case)
list, noobs
+---------------------------------------+
| name content1 content2 content3 |
|---------------------------------------|
| a o p q |
| b r s |
+---------------------------------------+

sort dataframe columns in R

Is there a way to sort dataframe columns in R. I tried with below, but the result is returning as character instead of dataframe
> asd <- data.frame(a = c("fsd","sdfsd"))
> asd <- with(asd, asd[order(a) , ])
> asd
[1] "fsd" "sdfsd"
Can we get in dataframe only?
Try this
a <- data.frame(x=LETTERS[1:5],y=c(5:1))
a[order(a$x),]
a[order(a$y),]
> a[order(a$x),]
x y
1 A 5
2 B 4
3 C 3
4 D 2
5 E 1
> a[order(a$y),]
x y
5 E 1
4 D 2
3 C 3
2 B 4
1 A 5

Combine overlapping categorical variables

I am trying to "combine" two categorical variables in Stata (say var1 and var2) into a new (also categorical) variable (say res).
The example below illustrates what I am trying to achieve:
var1 var2 res
1 1 A
1 2 A
2 1 A
3 3 B
4 2 A
5 4 D
What this example does is to combine all categories of var1 and var2 that "overlap".
Here, the pair var1 == 1 and var2 == 1 initially form a group (res== A). All other pairs containing var1 == 1 or var2 == 1 should belong to the same group (hence res== A in rows 2 and 3). Because in row 2 we have var2==2, any pair with containing var2==2 should belong to the same group. That's why in row 4 res== A.
Another way to look at this problem is using the following matrix:
| 1 2 3 4
-----------------------
1 | 1 1
2 | 1
3 | 1
4 | 1
5 | 1
Because the element [1,1] is not empty (or zero), all elements in row 1 and column 1 must belong to the same group. Because [1,2] is not empty, the same is true for row 1, column 2. And so on and so forth. It does not matter which row/column you decide to start from.
egen group alone doesn't cut it.
Any ideas?
Sounds like you want to further group var1 if the values of var2 are the same. If that's the case, then you can use a program I wrote called group_id that's available from SSC. To install it, type in Stata's Command window:
ssc install group_id
Here's an example of how you would use it:
* Example generated by -dataex-. To install: ssc install dataex
clear
input float(var1 var2) str1 res
1 1 "A"
1 2 "A"
2 1 "A"
3 3 "B"
4 2 "A"
5 4 "D"
end
gen long wanted = var1
group_id wanted, matchby(var2)
list, sep(0)
and the results:
. list, sep(0)
+----------------------------+
| var1 var2 res wanted |
|----------------------------|
1. | 1 1 A 1 |
2. | 1 2 A 1 |
3. | 2 1 A 1 |
4. | 3 3 B 3 |
5. | 4 2 A 1 |
6. | 5 4 D 5 |
+----------------------------+

how to impute two variables simultaneously in Stata?

I am trying to impute two variables simultaneously in Stata: say y and x. And then I want to perform a linear regression for them.
The code I used are:
mi set mlong
mi register imputed y x
mi impute regress y a b c, add(10)
mi impute regress x a b c, add(10)
mi estimate: regress y x
I run into an error: "estimation sample varies between m=1 and m=11". Can someone help me out? Thanks!
I prefer doing it using chained equations. The code below should work (note that Part 1 can be skipped as I only used it to generate a suitable mock dataset):
* Part 1
clear all
set seed 0945
set obs 50
gen y0 = _n
gen y = runiform()
sort y
gen x0 = _n
gen x = runiform()
sort x
replace y = . in 1
replace y = . in 5
replace y = . in 10
replace y = . in 15
replace y = . in 20
replace y = . in 25
replace y = . in 30
replace y = . in 35
replace y = . in 40
replace y = . in 45
replace y = . in 50
sort y
replace x = . in 1
replace x = . in 5
replace x = . in 10
replace x = . in 15
replace x = . in 20
replace x = . in 25
replace x = . in 30
replace x = . in 35
replace x = . in 40
replace x = . in 45
replace x = . in 50
gen a = _n
sort x
gen b = _n
gen c = _n
* Part 2
mi set mlong
mi register imputed y x
mi impute chained (regress) y x = a b c, add(10)
mi estimate, dots: regress y x

creating a new column based on observations in other columns

I am having trouble creating a new variable using conditions, I've tried data steps but to no avail.
My data set looks like this right now:
A B C D E
1 . 1 1 .
. 1 . . .
1 . 1 . 1
I need to look like this
A B C D E R
. . . . 1
. 1 . . . .
. . . . . 1
So the idea that i've used is if the sum of a -- d is greater than 1 then set R equal to 1 else . and then drop the observations if 1 is present in a & b & c & d & e but its not doing it for me perhaps its due to missing values.
code i've used so far:
data campZ;
set campY;
select;
when (sum(Macroscopic -- Symbolic > 1)) Random = 1;
otherwise; end;
run;
I've tried Proc SQL as well but I have been mainly focusing on the data step but any help will be great.
Thank you!
Will
It looks like you want to both SET R and clear the other variables. You need to add the OF keyword when using a variable list as an argument to a function.
data campZ;
set campY;
if sum(of Macroscopic -- Symbolic) > 1 then do;
Random = 1;
call missing(of Macroscopic -- Symbolic);
end;
run;
SELECT A, B, C, D, E,
CASE WHEN A+B+C+D > 1 THEN 1 END AS R
FROM Table;
(Apologies if I've made any syntax slips, my SAS SQL is a bit rusty.)
You can execute a query to do this . . . although I think a data step is quite reasonable. Here is one way to do the above in proc sql.
proc sql
select (case when cnt <= 1 then a end) as a,
(case when cnt <= 1 then b end) as b,
(case when cnt <= 1 then c end) as c,
(case when cnt <= 1 then d end) as d,
(case when cnt <= 1 then e end) as e,
(case when cnt > 1 then 1 end) as r
from (select z.*,
((case when a is null then 0 else 1 end) +
(case when b is null then 0 else 1 end) +
(case when c is null then 0 else 1 end) +
(case when d is null then 0 else 1 end) +
(case when e is null then 0 else 1 end)
) as cnt
from campz z
) z ;
This just returns the values. If you want them in a new data set, then use create table as.
data a;
input A B C D E;
cards;
1 . 1 1 .
. 1 . . .
1 . 1 . 1
;
proc sql noprint;
create table a1 as
select *, case
when sum(a,b,c,d,e)>1 then 1
when sum(a,b,c,d,e)<=1 then .
end as R from a;
update a1 set A=., B=., C=., D=., E=.
where R=1;
quit;
OutPut
Obs A B C D E R
1 . . . . . 1
2 . 1 . . . .
3 . . . . . 1