Stata function equivalent to group_concat - stata

I would like to group concatenate a categorical variable. Example:
pat x
1 a
1 b
1 b
2 a
2 a
The group concatenating should result in:
pat y
1 a-b-b
2 a-a
In Mysql this would be done using group_concat:
SELECT pat, GROUP_CONCAT(x SEPARATOR '-') y FROM tb GROUP BY pat
Also it would be nice if the function could concatenate distinct ordered values. With above example the output should be:
pat y
1 a-b
2 a
With MySQL:
SELECT pat, GROUP_CONCAT(DISTINCT x ORDER BY x SEPARATOR '-') y FROM tb GROUP BY pat

Note that this would reduce the data set to fewer observations.
bysort pat y: keep if _n == 1
by pat: gen Y = y[1]
by pat: replace Y = Y[_n-1] + "-" + y if _n > 1
by pat: keep if _n == _N

Related

Collapsing multiple rows into a single row based on a common identifier

Working in Stata, suppose I have a data table like this...
Household Identifier
Person Identifier
Var1
Var2
1
1
a
b
1
1
c
d
1
2
e
f
2
1
g
h
2
1
i
j
2
1
k
l
2
2
m
n
2
2
o
p
3
1
q
r
I want to be able to combine these so there is just one observation per household, i.e. like this
Household Identifier
Person1_Var1_1
Person1_Var2_1
Person1_Var1_2
Person1_Var2_2
Person1_Var3_1
Person1_Var3_2
Person2_Var1_1
Person2_Var2_1
Person2_Var1_2
Person2_Var2_2
Person2_Var3_1
Person2_Var3_2
1
a
b
c
d
.
.
e
f
.
.
.
.
2
g
h
i
j
k
l
m
n
o
p
.
.
3
q
r
.
.
.
.
.
.
.
.
.
.
Is there a straightforward way of doing this?
You can use reshape wide twice. Note that when I create rowid, I add an underscore to it; I also add underscore to the var1 and var2 columns. In the first reshape call, I use string to identify rowid as a string variable
bysort householdidentifier personidentifier: gen rowid = strofreal(_n) + "_"
rename var* =_
reshape wide var1 var2, i(householdidentifier personidentifier) j(rowid) string
reshape wide var*, i(householdidentifier) j(personidentifier)
Output:
househ~r var1_1_1 var2_1_1 var1_2_1 var2_2_1 var1_3_1 var2_3_1 var1_1_2 var2_1_2 var1_2_2 var2_2_2 var1_3_2 var2_3_2
1. 1 a b c d e f
2. 2 g h i j k l m n o p
3. 3 q r

Sort graph bar by target variables, followed by sort variables

I make a lot of graphs comparing two groups (e.g., male/female) across a number of variables. The standard -graph bar- output groups all bars for men together, and all bars for women together. I am hoping to find a simple way to make bar graphs that group bars first by the target variable (i.e. the variables being graphed), and then by the -over- variable, such as gender.
I have a method for doing this, but it is quite cumbersome. See illustration below.
*Set seed + obs
clear
set seed 442
set obs 100
*Generate two outcomes
gen x1 = uniform()
gen x2 = uniform()
*Generate crossing variable
gen gender = 0 in 1/50
replace gender = 1 in 51/100
label define gender_lab 0 "Male" 1 "Female"
label values gender gender_lab
*Extract means by gender
gen b_male = .
gen b_female = .
sum x1 if gender == 0
replace b_male = r(mean) in 1
sum x1 if gender == 1
replace b_female = r(mean) in 1
sum x2 if gender == 0
replace b_male = r(mean) in 2
sum x2 if gender == 1
replace b_female = r(mean) in 2
*Establish order of graph
gen index_male = _n*3 in 1/2
gen index_female = (_n*3) + 1 in 1/2
*This is what -graph bar- produces naturally
graph bar x1 x2, over(gender)
*This is closer to what I want
twoway bar b_male index_male || bar b_female index_female, xlabel(3.5 "x1" 6.5 "x2", notick labgap(4)) xmlabel(3 "Male" 4 "Female" 6 "Male" 7 "Female") legend(off)
Is there a simple way to use graph bar but still establish the sort order I want? I produce dozens of these graphs per day sometimes, so I want to avoid unnecessary steps as much as possible.
This is a model question: thank you very much!
I'll first copy your code, with some small simplifications which may be of interest any way.
*Set seed + obs
clear
set seed 442
set obs 100
*Generate two outcomes
gen x1 = runiform()
gen x2 = runiform()
*Generate crossing variable
gen gender = _n > 50
label define gender_lab 0 "Male" 1 "Female"
label values gender gender_lab
*Extract means by gender
sum x1 if gender == 0
gen b_male = r(mean) in 1
sum x1 if gender == 1
gen b_female = r(mean) in 1
sum x2 if gender == 0
replace b_male = r(mean) in 2
sum x2 if gender == 1
replace b_female = r(mean) in 2
*Establish order of graph
gen index_male = _n*3 in 1/2
gen index_female = (_n*3) + 1 in 1/2
*This is what -graph bar- produces naturally
graph bar x1 x2, over(gender) name(G1)
*This is closer to what I want
twoway bar b_male index_male || bar b_female index_female, ///
xlabel(3.5 "x1" 6.5 "x2", notick labgap(4)) ///
xmlabel(3 "Male" 4 "Female" 6 "Male" 7 "Female") legend(off) name(G2)
The good news is that there is a one-line solution once you have installed statplot by Eric A. Booth and myself from SSC. (The email address for Eric is the help file is no longer current.)
ssc inst statplot
statplot x1 x2, over(gender)
statplot x1 x2, over(gender) recast(bar)
statplot x1 x2, over(gender) recast(bar) asyvars yla(, ang(h)) ///
bar(2, bcolor(orange*0.8)) bar(1, bcolor(blue*0.8))
Here is the last graph to show what is done.
statplot defaults to means, what is what you show, so you don't have to calculate means. Other statistics are available.

Fill in missing values of one variable using match with another variable

Imagine the following Stata data structure:
input x y
1 3
1 .
1 .
2 3
2 .
2 .
. 3
end
I want to fill the missing values using the corresponding match of pairs for other observations. However, if there is ambiguity (in the example, 3 corresponding to both 1 and 2), the code should not copy. In my example, the final data structure should look like this:
1 3
1 3
1 3
2 3
2 3
2 3
. 3
Note that both 1 and 2 are filled, as they are unambiguously 3.
My data is only numeric, and the number of unique values of variables x and y is large, so I am looking for a general rule that works in every case.
I am thinking on using the user-written command carryforward, running something like
bysort x: carryforward y if x != . , replace dynamic_condition(x[_n-1] == x[_n]) strict
bysort y: carryforward x if y != . , replace dynamic_condition(y[_n-1] == y[_n]) strict
Yet, this does not work when there are double matches.
UPDATE: the solution proposed by Nick does not work for every example. I updated the example to reflect this. The reason why the proposed solution does not work is because the function tag puts a 1 only at one instance of each value. Thus, when a value (3) is related to two values (1, 2), the tag will appear only in one of them. Hence, the copying occurs for one. In the example above, Nick's code and results are:
egen tagy = tag(y) if !missing(y)
egen tagx = tag(x) if !missing(x)
egen ny = total(tagy), by(x)
egen nx = total(tagx), by(y)
bysort x (y) : replace y = y[1] if ny == 1
bysort y (x) : replace x = x[1] if nx == 1
list, sep(0)
+-------------------------------+
| x y tagy tagx ny nx |
|-------------------------------|
1. | 1 3 0 0 1 0 |
2. | 1 3 0 0 1 0 |
3. | 1 3 1 1 1 2 |
4. | 2 3 0 1 0 2 |
5. | . 3 0 0 0 2 |
6. | 2 . 0 0 0 0 |
7. | 2 . 0 0 0 0 |
+-------------------------------+
As seen, the code works for filling x=1 and not filling y=3 (line 5). Yet, it does not fill lines 6 and 7 because tagy=1 only appears once (x=1).
This is a bit clunky, but it should work:
bysort x: egen temp=sd(x) if x!=.
bysort x (y): replace y=y[1] if temp==0
drop temp
Since the standard deviation of a constant is zero, temp=0 if non-missing x's are all the same.
sort x, y
replace y = y[_n-1] if missing(y) & x[_n-1] == x[_n]

how to impute two variables simultaneously in Stata?

I am trying to impute two variables simultaneously in Stata: say y and x. And then I want to perform a linear regression for them.
The code I used are:
mi set mlong
mi register imputed y x
mi impute regress y a b c, add(10)
mi impute regress x a b c, add(10)
mi estimate: regress y x
I run into an error: "estimation sample varies between m=1 and m=11". Can someone help me out? Thanks!
I prefer doing it using chained equations. The code below should work (note that Part 1 can be skipped as I only used it to generate a suitable mock dataset):
* Part 1
clear all
set seed 0945
set obs 50
gen y0 = _n
gen y = runiform()
sort y
gen x0 = _n
gen x = runiform()
sort x
replace y = . in 1
replace y = . in 5
replace y = . in 10
replace y = . in 15
replace y = . in 20
replace y = . in 25
replace y = . in 30
replace y = . in 35
replace y = . in 40
replace y = . in 45
replace y = . in 50
sort y
replace x = . in 1
replace x = . in 5
replace x = . in 10
replace x = . in 15
replace x = . in 20
replace x = . in 25
replace x = . in 30
replace x = . in 35
replace x = . in 40
replace x = . in 45
replace x = . in 50
gen a = _n
sort x
gen b = _n
gen c = _n
* Part 2
mi set mlong
mi register imputed y x
mi impute chained (regress) y x = a b c, add(10)
mi estimate, dots: regress y x

SQL Left Join logic in SAS Merge or Data step

I have below two datasets and need the third dataset as an output.
ONE TWO
---------- ----------
ID FLAG NUMB
1 N 2
2 Y 3
3 Y 9
4 N 2
5 N 3
9 Y 9
10 Y
OUTPUT
-------
ID FLAG NEW
1 N N
2 Y Y
3 Y Y
4 N N
5 N N
9 Y Y
10 Y N
If ONE.ID is found in TWO.NUMB and it's ONE.FLAG = Y then the new variable NEW = Y
else NEW = N
I was able to do this using PROC SQL as below.
proc sql;
create table output as
(
select distinct id, flag, case when numb is null then 'N' else 'Y' end as NEW
from one
left join
two
on id = numb
and flag = 'Y'
);
quit;
Could this be done in DATA step/MERGE?
since you have a sql step attempt here's an improvement on that
--this sql step does not require a merge--
proc sql noprint;
create table output as
select distinct *, case
when id in (select distinct numb from two) then "Y"
else "N"
end as new
from one
;
quit;