Reshaping Data in "chains" format (stata .DTA file) - stata

I've got data in "chain" format where there are subjects that get a treatment "locks" and subjects, or "links", that are recruited from each "lock". Therefore, my data are shaped both widely and long - how can I write a Stata .DTA program to reshape for running models? My data start like this
idlock idlink1 idlink2 ...
1 10 11 ...
2 20 21 ...
21 30 31 ...
and a link can be come a lock later on, but it is still a part of the chain of the original lock. So, 21 is a link in the chain that starts with 1.
There are up to 5 links for each new lock (idlink1-idlink5)

More details on what you want to do with the data are needed, but the first thing I would do is create some vars that summarize the number of links per lock (or describe the chains). Then you can treat the data as long panel data with the initial lock as the panelid and the timevar as the number of links or nodes in the chain. I assume you have some more variables in the dataset that you want to model (I've generated them as a random DV and some IVs), then you can model whatever it is you want to model using the suite of -xt- commands in Stata (some examples are provided below):
*******************************! BEGIN EXAMPLE
//this first part will input the dataset into stata//
clear
inp id link0 link1 link2 link3 link4
1 1 2 3 4 5
1000 97 98 99 . .
3 . . . . .
4 . . . . .
5 6 7 8 9 10
6 . . . . .
7 . . . . .
8 11 12 13 14 15
9 . . . . .
10 . . . . .
11 . . . . .
12 . . . . .
13 . . . . .
14 . . . . .
15 . . . . .
99 100 . . . . .
100 101 . . . .
101 . . . . .
end
//grab local macro with variables of interest//
unab cou: link*
di "`cou'"
//1. DETERMINE THE INITIAL LOCK//
tempvar pn
g `pn' = .
forval z=0/4{
forval x=1/`=_N' {
replace `pn'= id[_n-`x'] if id==link`z'[_n-`x']
}
}
gen ilock=.
lab var ilock "Initial Lock #"
replace ilock=1 if mi(`pn')
order ilock
l ilock
//2. Links assoc. with each ilock //
**count those with no links established**
count if mi(link0)
//ilocks//
levelsof id if ilock==1, local(ilocks)
foreach n in `ilocks' {
//initial step//
preserve
keep if id==`n'
global s`n' "`=link0' `=link1' `=link2' `=link3' `=link4'"
di "${s`n'}"
global s`n':subinstr global s`n' "." "", all
di "${s`n'}"
restore
}
macro li
//branches off each ilock//
foreach n in `ilocks' {
//branches//
di in red "Branch `b' for macro s`n'"
di as err "${s`n'}"
forval b = 1/10 {
qui token `"${s`n'}"'
while "`1'" != "" {
*di in y "`1'"
preserve
keep if id==`1'
if _N==1 {
global s`n' ${s`n'} `=link0' `=link1' `=link2' `=link3' `=link4'
di "${s`n'}"
global s`n':subinstr global s`n' "." "", all
di in yellow "${s`n'}"
global s`n':list uniq global(s`n')
}
restore
mac shift
}
}
}
//g ilock_number = ilock number if ilocks==branches//
g ilock_number = .
foreach n in `ilocks' {
replace ilock_number = id if id==`n'
di in y "${s`n'}"
global s`n':list uniq global(s`n')
qui token `"${s`n'}"'
while "`1'" != "" {
di in y "`1'"
replace ilock_number = `n' if id==`1'
mac shift
}
}
order ilock_number
sort ilock_number id
count if mi(ilock)
**Decriptives:Count # OF linknodes**
sort ilock id
bys ilock_number: count if mi(ilock)
sort id ilock
bys ilock_number, rc0: g linknodes = _n
order id link* linknodes ilock_n
l id link* ilock linknodes ilock_n, ta clean div
**descriptives**
ta ilock
ta ilock linknodes
**here are all the chains in your data**
levelsof ilock_number, loc(al)
foreach v in `al' {
macro list s`v'
}
// Running models //
**what kind of model do you want to run?**
**assume using ids to identify panels-->
**create fake dv/iv's for models**
drawnorm iv1-iv5
g dv = abs(int(rbinomial(10, .5)))
xtset ilock_number linknodes
xtreg dv iv*, re
**or model some link/lock info like the #links**
bys ilock_number: g ttl_nodes = _N
xtpoisson ttl_nodes iv* dv , re
*******************************! END EXAMPLE
^note: watch for wrapping issues in the code above!

Related

Collapsing multiple rows into a single row based on a common identifier

Working in Stata, suppose I have a data table like this...
Household Identifier
Person Identifier
Var1
Var2
1
1
a
b
1
1
c
d
1
2
e
f
2
1
g
h
2
1
i
j
2
1
k
l
2
2
m
n
2
2
o
p
3
1
q
r
I want to be able to combine these so there is just one observation per household, i.e. like this
Household Identifier
Person1_Var1_1
Person1_Var2_1
Person1_Var1_2
Person1_Var2_2
Person1_Var3_1
Person1_Var3_2
Person2_Var1_1
Person2_Var2_1
Person2_Var1_2
Person2_Var2_2
Person2_Var3_1
Person2_Var3_2
1
a
b
c
d
.
.
e
f
.
.
.
.
2
g
h
i
j
k
l
m
n
o
p
.
.
3
q
r
.
.
.
.
.
.
.
.
.
.
Is there a straightforward way of doing this?
You can use reshape wide twice. Note that when I create rowid, I add an underscore to it; I also add underscore to the var1 and var2 columns. In the first reshape call, I use string to identify rowid as a string variable
bysort householdidentifier personidentifier: gen rowid = strofreal(_n) + "_"
rename var* =_
reshape wide var1 var2, i(householdidentifier personidentifier) j(rowid) string
reshape wide var*, i(householdidentifier) j(personidentifier)
Output:
househ~r var1_1_1 var2_1_1 var1_2_1 var2_2_1 var1_3_1 var2_3_1 var1_1_2 var2_1_2 var1_2_2 var2_2_2 var1_3_2 var2_3_2
1. 1 a b c d e f
2. 2 g h i j k l m n o p
3. 3 q r

How to get unique values for all variables in a dataset

I'm using Stata. I have a dataset with approximately 1800 observations and 1050 variables. Most of them are categorical variables with a few categories. It looks something like this:
------------------------------------------------------
| id | fh_1 | fh_1a | fh_2 | fh_2a | fh_3 | fh_3a |...
------------------------------------------------------
|1111| 1 |closed | 2 | 4 | 1 | open |...
------------------------------------------------------
|1112| 2 | open | 1 | 2 | 3 | closed|...
------------------------------------------------------
.
.
.
I need to export to an Excel sheet the list of all variables in this dataset with all unique values for each variable. It should look something like this:
--------------------------
|variable | unique_values|
--------------------------
| fh | 1 2 3 4 5 |
--------------------------
|fh_1a | closed open |
--------------------------
.
.
.
I think I need a loop with the command levelsof but I'm not sure how to build it. Any suggestions?
foreach v of var * {
levelsof `v'
}
would be a start, but I haven't directly addressed how to make that output Excel-friendly.
One possibility is to put all the output in string variables given that the number of observations exceeds the number of variables.
gen varname = ""
gen levels = ""
local i = 1
foreach v of var * {
levelsof `v'
replace varname = "`v'" in `i'
replace levels = `"`r(levels)'"' in `i'
local ++i
}
Here is one way to solve it. You might run in to issues if you have strings variable where some observations have values that are strings composed of more than one word. Then there is no way to tell if it was one observation with both words or two observations with one word each.
The values are sorted alphabetically, so you might be able to figure out anyway, but it could be ambivalent.
sysuse auto,clear
* Get a list of all vars apart from whatever var we do not want to include
ds make, not
local all_vars_but_id `r(varlist)'
* Get the number of vars, represents the number of rows in the dataset to be exported
local num_vars : word count `all_vars_but_id'
* Get the values for each var and store in local with same name as var
foreach var of local all_vars_but_id {
levelsof `var'
local `var' `r(levels)'
}
*Preserve the original data
preserve
* Remove the data and set up the data set to be exported
clear
set obs `num_vars'
gen var = ""
gen values = ""
* Copy the value of the locals created abobe to one row per variable
local counter 1
foreach var of local all_vars_but_id {
replace var = "`var'" if _n == `counter'
replace values = "``var''" if _n == `counter'
local counter = `counter' + 1
}
* Export to Excel
export excel using "C:\path/to/file/unique_values.xls"
*Restore the original data
restore
Another option using levelsof
input id str6(var1 var2 var3)
1 "open" "2" "3"
2 "closed" "1" "2"
3 "open" "1" "1"
end
reshape long var, i(id)
rename var values
rename _j var
gen unique_values = ""
forvalues i = 1/3 {
levelsof values if var == `i'
replace unique_values = r(levels) if var == `i'
}
replace unique_values = subinstr(unique_values,"`","",.)
replace unique_values = subinstr(unique_values,`"""',"",.)
replace unique_values = subinstr(unique_values,"'","",.)
contract var unique_values
drop _freq
list, noobs

How to collapse two columns into one in SAS

What I have
A dataset with 8 row x4 col
"Condition" "A_1" "B_1"
A 1 .
A 3 .
A 2 .
A 4 .
B . 4
B . 3
B . 5
B . 6
[
What I want is either:
What I want 1
(1)
"Condition" "A_1" "B_1"
A 1 .
A 3 .
A 2 .
A 4 .
B 4 .
B 3 .
B 5 .
B 6 .
OR, (2):
What I want 2
"Condition" "A_1" "B_1" "AB_1"
A 1 . 1
A 3 . 3
A 2 . 2
A 4 . 4
B . 4 4
B . 3 3
B . 5 5
B . 6 6
It was easy with STATA, R, and Excel (of course), but for the life of me I can't figure out this simple thing in SAS.
I tried,
data want;
if condition = "B" then A_1 = B_1;
set have;
run;
I also tried
data want;
if condition = "A" then AB_1 = A_1;
else AB_1 = B_1;
set have;
run;
The second code almost does the job except that the resulting AB_1 lags by 1 row.
What the hack...
Use coalesce. You also need your set statement before doing any of your logic. SAS reads a row when it encounters the set statement.
data want;
set have;
AB_1 = coalesce(A_1, B_1);
run;
for (2) you can try to use cats() function
data want;
set have;
AB_1 = cats(A_1,B_1);
run;
But in order to do the concatenate columns of different types you should also use the explicit PUT() function.

Group by with percentages and raw numbers

I have a dataset that looks like this:
I would like to create a table that groups by area and shows the total amount for the area both as a percentage of total amount and as a raw number, as well as the percent of the total number of records/observations per area and total number of records/observations as a raw number.
The code below works to generate a table of raw numbers but does not the show percent of total:
tabstat amount, by(county) stat(sum count)
There isn't a canned command for doing what you want. You will have to program the table yourself.
Here's a quick example using auto.dta:
. sysuse auto, clear
(1978 Automobile Data)
. tabstat price, by(foreign) stat(sum count)
Summary for variables: price
by categories of: foreign (Car type)
foreign | sum N
---------+--------------------
Domestic | 315766 52
Foreign | 140463 22
---------+--------------------
Total | 456229 74
------------------------------
You can do the calculations and save the raw numbers in variables as follows:
. generate total_obs = _N
. display total_obs
74
. count if foreign == 0
52
. generate total_domestic_obs = r(N)
. count if foreign == 1
22
. generate total_foreign_obs = r(N)
. egen total_domestic_price = total(price) if foreign == 0
. sort total_domestic_price
. local tdp = total_domestic_price
. display total_domestic_price
315766
. egen total_foreign_price = total(price) if foreign == 1
. sort total_foreign_price
. local tfp = total_foreign_price
. display total_foreign_price
140463
. generate total_price = `tdp' + `tfp'
. display total_price
456229
And for the percentages:
. generate pct_domestic_price = (`tdp' / total_price) * 100
. display pct_domestic_price
69.212173
. generate pct_foreign_price = (`tfp' / total_price) * 100
. display pct_foreign_price
30.787828
EDIT:
Here's a more automated way to do the above without having to specify individual values:
program define foo
syntax varlist(min=1 max=1), by(string)
generate total_obs = _N
display total_obs
quietly levelsof `by', local(nlevels)
foreach x of local nlevels {
count if `by' == `x'
quietly generate total_`by'`x'_obs = r(N)
quietly egen total_`by'`x'_`varlist' = total(`varlist') if `by' == `x'
sort total_`by'`x'_`varlist'
local tvar`x' = total_`by'`x'_`varlist'
local tvarall `tvarall' `tvar`x'' +
display total_`by'`x'_`varlist'
}
quietly generate total_`varlist' = `tvarall' 0
display total_`varlist'
foreach x of local nlevels {
quietly generate pct_`by'`x'_`varlist' = (`tvar`x'' / total_`varlist') * 100
display pct_`by'`x'_`varlist'
}
end
The results are identical:
. foo price, by(foreign)
74
52
315766
22
140463
456229
69.212173
30.787828
You will obviously need to format the results in a table of your liking.
Here's another approach. I stole #Pearly Spencer's example. It could be generalised to a command. The main message I want to convey is that list is useful for tabulations and other reports, with just usually some obligation to calculate what you want to show beforehand.
. sysuse auto, clear
(1978 Automobile Data)
. preserve
. collapse (sum) total=price (count) obs=price, by(foreign)
. egen pc2 = pc(total)
. egen pc1 = pc(obs)
. char pc2[varname] "%"
. char pc1[varname] "%"
. format pc* %2.1f
. list foreign obs pc1 total pc2 , subvarname noobs sum(obs pc1 total pc2)
+-----------------------------------------+
| foreign obs % total % |
|-----------------------------------------|
| Domestic 52 70.3 315766 69.2 |
| Foreign 22 29.7 140463 30.8 |
|-----------------------------------------|
Sum | 74 100.0 456229 100.0 |
+-----------------------------------------+
. restore
EDIT Here's an essay in egen with similar flavour but leaving the original data in place and new variables also available for export or graphics.
. sysuse auto, clear
(1978 Automobile Data)
. egen total = sum(price), by(foreign)
. egen obs = count(price), by(total)
. egen tag = tag(foreign)
. egen pc2 = pc(total) if tag
(72 missing values generated)
. egen pc1 = pc(obs) if tag
(72 missing values generated)
. char pc2[varname] "%"
. char pc1[varname] "%"
. format pc* %2.1f
. list foreign obs pc1 total pc2 if tag, subvarname noobs sum(obs pc1 total pc2)
+-----------------------------------------+
| foreign obs % total % |
|-----------------------------------------|
| Domestic 52 70.3 315766 69.2 |
| Foreign 22 29.7 140463 30.8 |
|-----------------------------------------|
Sum | 74 100.0 456229 100.0 |
+-----------------------------------------+

Convert one to many with 2 digits

I am currently handling a data set in Stata generated through ODK, the open data kit.
There is an option to answer questions with multiple answers. E.g. in my questionnaire "Which of these assets do you own?" and the interviewer tagged all the answers out of 20 options.
This generated for me a string variable with contents such as
"1 2 3 5 11 17 20"
"3 4 8 9 11 14 15 18 20"
"1 3 9 11"
As this is difficult to analyse for several hundred participants, I wanted to generate new variables creating a 1 or 0 for each of the answer options.
For the variable hou_as I tried to generate the variables hou_as_1, hou_as_2 etc. with the following code:
foreach p in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 {
local P : subinstr local p "-" ""
gen byte hou_as_`P' = strpos(hou_as, "`p'") > 0
}
For the single digits this brings the problem that the variable hou_as_1 is also filled with a 1 if any of the 10 11 12 ... 19 is filled even if the option 1 was not chosen. Similarly hou_as_2 is filled when the option 2, 12 or 20 is checked.
How can I avoid this issue?
You want 20 indicator or dummy variables. Note first that it's much easier to use forval to loop 1(1)20, e.g.
forval j = 1/20 {
gen hou_as_`j' = 0
}
initialises 20 such variables as 0.
I think it's easier to loop over the words of your answer variables, words being here just whatever is separated by spaces. There are at most 20 words, and it is a little crude but likely to be fast enough to go
forval j = 1/20 {
forval k = 1/20 {
replace hou_as_`j' = 1 if word(hou_as, `k') == "`j'"
}
}
Let's put that together and try it out on your example:
clear
input str42 hou_as
"1 2 3 5 11 17 20"
"3 4 8 9 11 14 15 18 20"
"1 3 9 11"
end
forval j = 1/20 {
gen hou_as_`j' = 0
forval k = 1/20 {
replace hou_as_`j' = 1 if word(hou_as, `k') == "`j'"
}
}
Just to show that it worked:
. list in 3
+----------------------------------------------------------------------------+
3. | hou_as | hou_as_1 | hou_as_2 | hou_as_3 | hou_as_4 | hou_as_5 | hou_as_6 |
| 1 3 9 11 | 1 | 0 | 1 | 0 | 0 | 0 |
|----------+----------+----------+----------+----------+----------+----------|
| hou_as_7 | hou_as_8 | hou_as_9 | hou_a~10 | hou_a~11 | hou_a~12 | hou_a~13 |
| 0 | 0 | 1 | 0 | 1 | 0 | 0 |
|----------+----------+----------+----------+----------+----------+----------|
| hou_a~14 | hou_a~15 | hou_a~16 | hou_a~17 | hou_a~18 | hou_a~19 | hou_a~20 |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+----------------------------------------------------------------------------+
Incidentally, your line
local P : subinstr local p "-" ""
does nothing useful. The local macro p only ever has contents which are integer digits, so there is no punctuation at all to remove.
See also this explanation and
. search multiple responses, sj
Search of official help files, FAQs, Examples, SJs, and STBs
SJ-5-1 st0082 . . . . . . . . . . . . . . . Tabulation of multiple responses
(help _mrsvmat, mrgraph, mrtab if installed) . . . . . . . . B. Jann
Q1/05 SJ 5(1):92--122
introduces new commands for the computation of one- and
two-way tables of multiple responses
SJ-3-1 pr0008 Speaking Stata: On structure & shape: the case of mult. resp.
. . . . . . . . . . . . . . . . . . . . . . . . N. J. Cox & U. Kohler
Q1/03 SJ 3(1):81--99 (no commands)
discussion of data manipulations for multiple response data