Merge data in Stata by creating new variables - stata

I have two sets of data below in Stata
name
a
b
name
case #
content
a
1
o
a
2
p
a
3
q
b
1
r
b
2
s
How do I turn them into:
name
1st case
2nd case
3rd case
a
o
p
q
b
r
s

clear
input str1(name) case str1(content)
a 1 o
a 2 p
a 3 q
b 1 r
b 2 s
end
reshape wide content, i(name) j(case)
list, noobs
+---------------------------------------+
| name content1 content2 content3 |
|---------------------------------------|
| a o p q |
| b r s |
+---------------------------------------+

Related

Collapsing multiple rows into a single row based on a common identifier

Working in Stata, suppose I have a data table like this...
Household Identifier
Person Identifier
Var1
Var2
1
1
a
b
1
1
c
d
1
2
e
f
2
1
g
h
2
1
i
j
2
1
k
l
2
2
m
n
2
2
o
p
3
1
q
r
I want to be able to combine these so there is just one observation per household, i.e. like this
Household Identifier
Person1_Var1_1
Person1_Var2_1
Person1_Var1_2
Person1_Var2_2
Person1_Var3_1
Person1_Var3_2
Person2_Var1_1
Person2_Var2_1
Person2_Var1_2
Person2_Var2_2
Person2_Var3_1
Person2_Var3_2
1
a
b
c
d
.
.
e
f
.
.
.
.
2
g
h
i
j
k
l
m
n
o
p
.
.
3
q
r
.
.
.
.
.
.
.
.
.
.
Is there a straightforward way of doing this?
You can use reshape wide twice. Note that when I create rowid, I add an underscore to it; I also add underscore to the var1 and var2 columns. In the first reshape call, I use string to identify rowid as a string variable
bysort householdidentifier personidentifier: gen rowid = strofreal(_n) + "_"
rename var* =_
reshape wide var1 var2, i(householdidentifier personidentifier) j(rowid) string
reshape wide var*, i(householdidentifier) j(personidentifier)
Output:
househ~r var1_1_1 var2_1_1 var1_2_1 var2_2_1 var1_3_1 var2_3_1 var1_1_2 var2_1_2 var1_2_2 var2_2_2 var1_3_2 var2_3_2
1. 1 a b c d e f
2. 2 g h i j k l m n o p
3. 3 q r

django left join and right join implement sqlite

I have two tables as shown below :
country
id name
1 A
2 b
3 c
state
id | country_id | name | population
1 | 1 | x | 234354
2 | 1 | y | 2334
3 | 2 | h | 232323
4 | 2 | E | 8238787
Now I want query with sum population with country name like this :
a has xxxx population
b has xxxx population
c has 0 population
In django query, I have write this query :
City.objects.values('country__name').annotate(Sum('population'))
But this has not display 0 for c country :(
The query is not showing any record for c country, because table does not have any record for c country.
City.objects.values('country__name').annotate(Sum('population'))
This query will show any those records which are there in the City Model.

"Squaring" a dataset

I have a set of plants (A, B, C) that may act both as senders or as receivers but in practice, not all are actually sending or receiving. I need to fill in the missing connections to make the data matrix "square" (or "quadratic") as opposed to rectangularizing it.
Here is my data:
clear
input str1 sender str1 receiver value
A B 100
A C 200
B A 100
end
Stata's fillin command almost does what I want:
fillin sender receiver
drop if sender == receiver
list
+-------------------------------------+
| sender receiver value _fillin |
|-------------------------------------|
1. | A B 100 0 |
2. | A C 200 0 |
3. | B A 100 0 |
4. | B C . 1 |
+-------------------------------------+
Below is the output I expect:
+-----------------------------+
| sender receiver value |
|-----------------------------|
1. | A B 100 |
2. | A C 200 |
3. | B A 100 |
4. | B C . |
5. | C A . |
6. | C B . |
+-----------------------------+
Is there a simple way of doing this?
This is a step more general than #Pearly Spencer's solution.
clear
input str1 sender str1 receiver value
A B 100
A C 200
B A 100
end
egen tag = tag(receiver)
local N = _N
expand 2 if tag
replace sender = receiver if _n > `N'
replace value = . if _n > `N'
fillin sender receiver
drop if sender == receiver
list, sepby(sender)
+-------------------------------------------+
| sender receiver value tag _fillin |
|-------------------------------------------|
1. | A B 100 1 0 |
2. | A C 200 1 0 |
|-------------------------------------------|
3. | B A 100 1 0 |
4. | B C . . 1 |
|-------------------------------------------|
5. | C A . . 1 |
6. | C B . . 1 |
+-------------------------------------------+
You need to provide Stata with the missing piece of information and then apply fillin:
clear
input str1 sender str1 receiver value
A B 100
A C 200
B A 100
end
set obs 4
replace sender = "C" in 4
replace receiver = "A" in 4
fillin sender receiver
drop if sender == receiver
list, separator(0)
+-------------------------------------+
| sender receiver value _fillin |
|-------------------------------------|
1. | A B 100 0 |
2. | A C 200 0 |
3. | B A 100 0 |
4. | B C . 1 |
5. | C A . 0 |
6. | C B . 1 |
+-------------------------------------+

remove duplicate (non-unique) paired values

I'm working with an edge list in Stata, of the type:
var1 var2
a 1
a 2
a 3
b 1
b 2
1 a
2 b
I want to remove non-unique pairs such as 1a and 2b (which are same as a1 and b2 for me). How can I go about this?
. clear
. input str1 (var1 var2)
var1 var2
1. a 1
2. a 2
3. a 3
4. b 1
5. b 2
6. 1 a
7. 2 b
8. end
. gen first = cond(var1 <= var2, var1, var2)
. gen second = cond(var1 <= var2, var2, var1)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
|------------------------------|
6. | 1 a 1 a |
7. | 2 b 2 b |
+------------------------------+
. duplicates list first second
Duplicates in terms of first second
+--------------------------------+
| group: obs: first second |
|--------------------------------|
| 1 1 1 a |
| 1 6 1 a |
| 2 5 2 b |
| 2 7 2 b |
+--------------------------------+
. duplicates drop first second, force
Duplicates in terms of first second
(2 observations deleted)
. list
+------------------------------+
| var1 var2 first second |
|------------------------------|
1. | a 1 1 a |
2. | a 2 2 a |
3. | a 3 3 a |
4. | b 1 1 b |
5. | b 2 2 b |
+------------------------------+
The easy part of the answer is to use duplicates drop. But how to get the data so that 1 a and a 1 are seen to be duplicates? This is all documented here. We can sort the values in each observation so that (in this case) both sort to 1 a. The linked paper says much more, but that's the main idea, and cond() helps.

Row-wise count/sum of values in Stata

I have a dataset where each person (row) has values 0, 1 or . in a number of variables (columns).
I would like to create two variables. One that includes the count of all the 0 and one that has the count of all the 1 for each person (row).
In my case, there is no pattern in the variable names. For this reason I create a varlist of all the existing variables excluding the ones that need not to be counted.
+--------+--------+------+------+------+------+------+----------+--------+
| ID | region | Qa | Qb | C3 | C4 | Wa | count 0 | count 1|
+--------+--------+------+------+------+------+------+----------+--------+
| 1 | A | 1 | 1 | 1 | 1 | . | 0 | 4 |
| 2 | B | 0 | 0 | 0 | 1 | 1 | 3 | 2 |
| 3 | C | 0 | 0 | . | 0 | 0 | 4 | 0 |
| 4 | D | 1 | 1 | 1 | 1 | 0 | 0 | 4 |
+--------+--------+------+------+------+------+------+----------+--------+
The following works, however, I cannot add an if statement
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_1s = rowtotal(`varlist')
If I change the last line with the one below, I get an error of invalid syntax.
egen count_of_1s = rowtotal(`varlist') if `v' == 1
I turned from count to summing because I thought this is a sneaky way out of the problem. I could change the values from 0,1 to 1, 2, then sum all the two values separately in two different variables and then divide accordingly in order to get the actual count of 1 or 2 per row.
I found this Stata: Using egen, anycount() when values vary for each observation however Stata freezes as my dataset is quite large (100.000 rows and 3000 columns).
Any help will be very appreciated :-)
Solution based on the response of William
* number of total valid responses (0s and 1s, excluding . )
ds ID region, not // all variables in the dataset apart from ID region
return list
local varlist = r(varlist)
egen count_of_nonmiss = rownonmiss(`varlist') // this counts all the 0s and 1s (namely, the non missing values)
* total numbers of 1s per row
ds ID region count_of_nonmiss, not // CAUTION: count_of_nonmiss needs not to be taken into account for this!
return list
local varlist = r(varlist)
generate count_of_1s = rowtotal(`varlist')
How about
egen count_of_nonmiss = rownonmiss(`varlist')
generate count_of_0s = count_of_nonmiss - count_of_1s
When the value of the macro varlist is substituted into your if clause, the command expands to
egen count_of_1s = rowtotal(`varlist') if Qa Qb C3 C4 Wa == 1
Clearly a syntax error.
I had the same problem to count the occurrences of specifies values in each observation across a set of variables.
I could resolve that problem in the following ways: If you want to count the occurrences of 0 in the values across x1-x2, so
clear
input id x1 x2 x3
id x1 x2 x3
1. 1 1 0 2
2. 2 2 0 2
3. 3 2 0 3
4. end
egen count2 = anycount(x1-x3), value(0)