Calculate the sum of a variable - stata

I would like to calculate the sum of variable boasav:
clear
input id boasav
1 2500
1 2900
1 4200
2 5700
2 6100
3 7400
3 7600
3 8300
end
I know that the tabulate command can be used to summarize data but it only counts:
bys id: tab boasav
-> id = 1
boasav | Freq. Percent Cum.
------------+-----------------------------------
2500 | 1 33.33 33.33
2900 | 1 33.33 66.67
4200 | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
-> id = 2
boasav | Freq. Percent Cum.
------------+-----------------------------------
5700 | 1 50.00 50.00
6100 | 1 50.00 100.00
------------+-----------------------------------
Total | 2 100.00
-> id = 3
boasav | Freq. Percent Cum.
------------+-----------------------------------
7400 | 1 33.33 33.33
7600 | 1 33.33 66.67
8300 | 1 33.33 100.00
------------+-----------------------------------
Total | 3 100.00
However, what I want is the following:
1 9600
2 11800
3 23300
Is there a function that can do this in Stata?

Here are three more.
clear
input id boasav
1 2500
1 2900
1 4200
2 5700
2 6100
3 7400
3 7600
3 8300
end
* Method 4: use summarize
forval g = 1/3 {
su boasav if id == `g', meanonly
di "`g' " %5.0f r(sum)
}
1 9600
2 11800
3 23300
* Method 5: tabstat
tabstat boasav, by(id) stat(sum)
Summary for variables: boasav
by categories of: id
id | sum
---------+----------
1 | 9600
2 | 11800
3 | 23300
---------+----------
Total | 44700
--------------------
* Method 6: use rangestat (SSC)
rangestat (sum) boasav, int(id 0 0)
tabdisp id, c(boasav_sum)
-------------------------
id | sum of boasav
----------+--------------
1 | 9600
2 | 11800
3 | 23300
-------------------------

Related

Create a numeric encoding of levels for a string variable in SAS

I would like to create a numeric indicator variable that captures the levels of a complex string id variable in SAS. Note that I cannot use if-then logic, as the id values will change constantly.
I have searched multiple threads and previous questions, and I am unable to find an approach that works.
Below is some sample code that captures the essence of the problem.
data value_id;
length id $54 ;
informat id $54. ;
input id $ ;
cards ;
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8
tsknifwb29-818zgpj2be-vq7558xhqa-1lgqck7219-rq1ojedtmp
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tred6a7g-h5iqsl8cir
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4lsnowcj-s4kg37wxah
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73olb67tm-nwwvla6r4g
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubjun73y1-mbaggyrmkq
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8qpgwn8b-l3biox4318
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9uf16zp-5y11re5um5
;
proc sort data = value_id
out = value_id_s ascii ;
by id ;
run;
Ideally, I would like to have a result where I can create a numeric indicator variable id_n, like the following:
id id_n
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
... ...
Any advice / guidance on the approach would be very much appreciated.
Aside
If I were using Stata, I would use encode and move on with my day...
. encode id, gen(id_n)
.
.
. tab1 id id_n , nolabel
-> tabulation of id
id | Freq. Percent Cum.
----------------------------------------+-----------------------------------
98llcibqon-u86ww0mzgo-ut58htcfv1-lybg.. | 4 10.00 10.00
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20.. | 4 10.00 20.00
byxl352kpd-se5godm0gv-jukpzv1u7x-8kff.. | 4 10.00 30.00
dbvdcbvafv-cmb2zp67tn-gpnfwcvvpt-nl8q.. | 4 10.00 40.00
gc6dny3d5n-qzdkgfkpoc-iv1vnmwu4d-hubj.. | 4 10.00 50.00
jg1qpqhofy-02d2m62ayb-fg2f6dtvqc-vx4l.. | 4 10.00 60.00
o3qadvrqtl-kdyw9qpfir-7xeilvuk7e-g73o.. | 4 10.00 70.00
ts1j2y9q6u-nghpfhdxsl-vkdwk060gg-s3tr.. | 4 10.00 80.00
tsknifwb29-818zgpj2be-vq7558xhqa-1lgq.. | 4 10.00 90.00
wfrbcr3i8e-3vodo5wkrr-hp733zwkhy-uxm9.. | 4 10.00 100.00
----------------------------------------+-----------------------------------
Total | 40 100.00
-> tabulation of id_n
id_n | Freq. Percent Cum.
------------+-----------------------------------
1 | 4 10.00 10.00
2 | 4 10.00 20.00
3 | 4 10.00 30.00
4 | 4 10.00 40.00
5 | 4 10.00 50.00
6 | 4 10.00 60.00
7 | 4 10.00 70.00
8 | 4 10.00 80.00
9 | 4 10.00 90.00
10 | 4 10.00 100.00
------------+-----------------------------------
Total | 40 100.00
Try this
proc sort data = value_id out = value_id_s ascii;
by id;
run;
data want;
set value_id_s;
by id;
if first.id then id_n + 1;
run;
Perhaps use this
data want;
retain id_n 0;
set value_id_s;
by id;
if first.id then id_n+1;
run;
id id_n
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
98llcibqon-u86ww0mzgo-ut58htcfv1-lybgj2gsn2-zlvu6n0mym 1
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
9jdbh2e7z8-dc4o8mgsft-qi778mt7s0-rz20vk4xwo-ybcx8gaiy8 2
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
byxl352kpd-se5godm0gv-jukpzv1u7x-8kffj5th80-mf04nzwvrf 3
...

Perform diff% on 2 different matrices

I really need help here.
My user illustrated what they wanted on Excel. And I have tried doing like that on Power BI using matrix viz. Here are examples of my data.
They are matrices with summarized data with different point of time
As of 7 Sep 2022
| GROUP A |Sub total | GROUP B |Sub total | Total
Category | CAPEX | OPEX | | CAPEX | OPEX | |
1. TP 0 1 1 2 3 5 6
2. MA 0 0 0 0 0 0 0
Total 0 1 1 2 3 5 6
As of 13 Sep 2022
| GROUP A |Sub total | GROUP B |Sub total | Total
Category | CAPEX | OPEX | | CAPEX | OPEX | |
1. TP 0 4 4 5 7 12 16
2. MA 0 0 0 0 0 0 0
Total 0 4 4 5 7 12 16
They want to see change from those 2 matrices in % (increase or decrease).
Something like this
| GROUP A |Sub total | GROUP B |Sub total | Total
Category | CAPEX | OPEX | | CAPEX | OPEX | |
1. TP 0% +300% +300% +150% +133% +140% +166%
2. MA 0% 0% 0% 0% 0% 0% 0%
Total 0% +300% +300% +150% +133% +140% +166%
Is there a way I could do like this on DAX or anything on Power BI?
Please help! Thank you!
Edited: Added sample data
Here is the data sample I am working on.
PROJECT_NAME
BUDGET_TYPE
Category
GROUP
Created
AAAAA
OPEX
1. TP
A
12/9/2022 22:07
BBBBBB
CAPEX
1. TP
A
11/9/2022 20:57
CCCCC
CAPEX
1. TP
B
4/9/2022 14:07
DDDDD
OPEX
1. TP
B
5/9/2022 13:57
EEEEEE
CAPEX
2. MA
A
9/9/2022 12:22
FFFFFF
OPEX
1. TP
B
7/9/2022 9:57
GGGGG
OPEX
2. MA
B
16/8/2022 22:08
HHHHH
CAPEX
1. TP
A
16/8/2022 22:07
Note:
I have the dimension tables for BUDGET_TYPE, Category, GROUP
I have a calendar table whose formula is CALENDAR = CALENDAR(DATE(2022,1,1), DATE(2022,12,31))

Save duplicates by id

I have two variables in Stata, id and price:
id price
1 4321
1 7634
1 7974
1 7634
1 3244
2 5943
2 3294
2 5645
2 3564
2 4321
2 4567
2 4567
2 4567
2 4567
3 5652
3 9586
3 5844
3 8684
3 2456
4 7634
Usually I can use the duplicates command to get the duplicate observations of a variable.
However, how can I create a new variable that will save the duplicates
of price for each id?
There is no reason that I can see for duplicates to work with by:. duplicates whatever price id is the general recipe with your example, to examine duplicates jointly for two variables. Consider
clear
input id price
1 4321
1 7634
1 7974
1 7634
1 3244
2 5943
2 3294
2 5645
2 3564
2 4321
2 4567
2 4567
2 4567
2 4567
3 5652
3 9586
3 5844
3 8684
3 2456
4 7634
end
. duplicates example id price
Duplicates in terms of id price
+------------------------------------+
| group: # e.g. obs id price |
|------------------------------------|
| 1 2 2 1 7634 |
| 2 4 11 2 4567 |
+------------------------------------+
. duplicates tag id price, gen(tag)
Duplicates in terms of id price
. list id price if tag , sepby(id)
+------------+
| id price |
|------------|
2. | 1 7634 |
4. | 1 7634 |
|------------|
11. | 2 4567 |
12. | 2 4567 |
13. | 2 4567 |
14. | 2 4567 |
+------------+
Beyond that, I am not clear exactly what output or data result you wish to see.
EDIT In response to comment, here are two more direct approaches. duplicates is based on the idea that duplicates are mostly unwanted; you seem to have the opposite point of view, in which case duplicates is oblique to your wants.
* approach 1
bysort price id : gen wanted = _n == 1 & _N > 1
list if wanted
+---------------------+
| id price wanted |
|---------------------|
7. | 2 4567 1 |
15. | 1 7634 1 |
+---------------------+
* approach 2
drop wanted
bysort price id : keep if _n == 1 & _N > 1
list
+------------+
| id price |
|------------|
1. | 2 4567 |
2. | 1 7634 |
+------------+
Naturally if you want to duplicate data yet further (why?) then after approach 1
gen duplicated_price = price if wanted
gives you one copy of each of the duplicated values in a new variable. This is a slightly simpler equivalent of #Pearly Spencer's approach.
bysort price id : gen duplicated_price = price if _n == 1 & _N > 1
does it in one line.

Ranking variables according to their percent contribution to total

Consider the following example data:
psu | sumsc sumst sumobc sumother sumcaste
-------|-----------------------------------------------
10018 | 3 2 0 4 9
|
10061 | 0 0 2 5 7
|
10116 | 1 1 2 4 8
|
10121 | 3 0 1 2 6
|
20002 | 4 1 0 1 6
-------------------------------------------------------
I want to rank the variables sumsc, sumst, sumobc, and sumother according to their percent contribution to sumcaste (this is the total of all variables) within psu.
Could anyone help me do this in Stata?
First we enter the data:
clear all
set more off
input psu sumsc sumst sumobc sumother sumcaste
10018 3 2 0 4 9
10061 0 0 2 5 7
10116 1 1 2 4 8
10121 3 0 1 2 6
20002 4 1 0 1 6
end
Second, we prepare the reshape:
local j=1
foreach var of varlist sumsc sumst sumobc sumother {
gen temprl`j' = `var' / sumcaste
ren `var' addi`j'
local ++j
}
reshape long temprl addi, i(psu) j(ord)
lab def ord 1 "sumsc" 2 "sumst" 3 "sumobc" 4 "sumother"
lab val ord ord
Third, we order before presenting:
gsort psu -temprl
by psu: gen nro=_n
drop temprl
order psu nro ord
Fourth, presenting the data:
br psu nro ord addi
EDIT:
This is a combination of Aron's solution with mine (#PearlySpencer):
clear
input psu sumsc sumst sumobc sumother sumcaste
10018 3 2 0 4 9
10061 0 0 2 5 7
10116 1 1 2 4 8
10121 3 0 1 2 6
20002 4 1 0 1 6
end
local i = 0
foreach var of varlist sumsc sumst sumobc sumother {
local ++i
generate pct`i' = 100 * `var' / sumcaste
rename `var' temp`i'
local rvars "`rvars' r`i'"
}
rowranks pct*, generate("`rvars'") field lowrank
reshape long pct temp r, i(psu) j(name)
label define name 1 "sumsc" 2 "sumst" 3 "sumobc" 4 "sumother"
label values name name
keep psu name pct r
bysort psu (r): replace r = sum(r != r[_n-1])
Which gives you the desired output:
list, sepby(psu) noobs
+---------------------------------+
| psu name pct r |
|---------------------------------|
| 10018 sumother 44.44444 1 |
| 10018 sumsc 33.33333 2 |
| 10018 sumst 22.22222 3 |
| 10018 sumobc 0 4 |
|---------------------------------|
| 10061 sumother 71.42857 1 |
| 10061 sumobc 28.57143 2 |
| 10061 sumsc 0 3 |
| 10061 sumst 0 3 |
|---------------------------------|
| 10116 sumother 50 1 |
| 10116 sumobc 25 2 |
| 10116 sumst 12.5 3 |
| 10116 sumsc 12.5 3 |
|---------------------------------|
| 10121 sumsc 50 1 |
| 10121 sumother 33.33333 2 |
| 10121 sumobc 16.66667 3 |
| 10121 sumst 0 4 |
|---------------------------------|
| 20002 sumsc 66.66666 1 |
| 20002 sumst 16.66667 2 |
| 20002 sumother 16.66667 2 |
| 20002 sumobc 0 3 |
+---------------------------------+
This approach will be useful if you need the variables for further analysis as opposed to just displaying the results.
First you need to calculate percentages:
clear
input psu sumsc sumst sumobc sumother sumcaste
10018 3 2 0 4 9
10061 0 0 2 5 7
10116 1 1 2 4 8
10121 3 0 1 2 6
20002 4 1 0 1 6
end
foreach var of varlist sumsc sumst sumobc sumother {
generate pct_`var' = 100 * `var' / sumcaste
}
egen pcttotal = rowtotal(pct_*)
list pct_* pcttotal, abbreviate(15) noobs
+--------------------------------------------------------------+
| pct_sumsc pct_sumst pct_sumobc pct_sumother pcttotal |
|--------------------------------------------------------------|
| 33.33333 22.22222 0 44.44444 100 |
| 0 0 28.57143 71.42857 100 |
| 12.5 12.5 25 50 100 |
| 50 0 16.66667 33.33333 100 |
| 66.66666 16.66667 0 16.66667 99.99999 |
+--------------------------------------------------------------+
Then you need to get the ranks and do some gymnastics:
rowranks pct_*, generate(r_sumsc r_sumst r_sumobc r_sumother) field lowrank
mkmat r_*, matrix(A)
matrix A = A'
svmat A, names(row)
local matnames : rownames A
quietly generate name = " "
forvalues i = 1 / `: word count `matnames'' {
quietly replace name = substr(`"`: word `i' of `matnames''"', 3, .) in `i'
}
ds row*
foreach var in `r(varlist)' {
sort `var' name
generate `var'b = sum(`var' != `var'[_n-1])
drop `var'
rename `var'b `var'
list name `var' if name != " ", noobs
display ""
}
The above will give you what you want:
+-----------------+
| name row1 |
|-----------------|
| sumother 1 |
| sumsc 2 |
| sumst 3 |
| sumobc 4 |
+-----------------+
+-----------------+
| name row2 |
|-----------------|
| sumother 1 |
| sumobc 2 |
| sumsc 3 |
| sumst 3 |
+-----------------+
+-----------------+
| name row3 |
|-----------------|
| sumother 1 |
| sumobc 2 |
| sumsc 3 |
| sumst 3 |
+-----------------+
+-----------------+
| name row4 |
|-----------------|
| sumsc 1 |
| sumother 2 |
| sumobc 3 |
| sumst 4 |
+-----------------+
+-----------------+
| name row5 |
|-----------------|
| sumsc 1 |
| sumother 2 |
| sumst 2 |
| sumobc 3 |
+-----------------+
Note that you will first need to install the community-contributed command rowranks before you execute the above code:
net install pr0046.pkg

Create table for asclogit and nlogit

Suppose I have the following table:
id | car | sex | income
-------------------------------
1 | European | Male | 45000
2 | Japanese | Female | 48000
3 | American | Male | 53000
I would like to create the one below:
| id | car | choice | sex | income
--------------------------------------------
1.| 1 | European | 1 | Male | 45000
2.| 1 | American | 0 | Male | 45000
3.| 1 | Japanese | 0 | Male | 45000
| ----------------------------------------
4.| 2 | European | 0 | Female | 48000
5.| 2 | American | 0 | Female | 48000
6.| 2 | Japanese | 1 | Female | 48000
| ----------------------------------------
7.| 3 | European | 0 | Male | 53000
8.| 3 | American | 1 | Male | 53000
9.| 3 | Japanese | 0 | Male | 53000
I would like to fit an asclogit and according to Example 1 in Stata's Manual, this table format seems necessary. However, i have not found a way to create this easily.
You can use the cross command to generate all the possible combinations:
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
generate choice = 0
save old, replace
keep id
save new, replace
use old
rename id =_0
cross using new
replace choice = 1 if id_0 == id
replace sex = cond(id == 2, "Female", "Male")
replace income = cond(id == 1, 45000, cond(id == 2, 48000, 53000))
Note that the use of the cond() function here is equivalent to:
replace sex = "Male" if id == 1
replace sex = "Female" if id == 2
replace sex = "Male" if id == 3
replace income = 45000 if id == 1
replace income = 48000 if id == 2
replace income = 53000 if id == 3
The above code snipped produces the desired output:
drop id_0
order id car choice sex income
sort id car
list, sepby(id)
+------------------------------------------+
| id car choice sex income |
|------------------------------------------|
1. | 1 American 0 Male 45000 |
2. | 1 European 1 Male 45000 |
3. | 1 Japanese 0 Male 45000 |
|------------------------------------------|
4. | 2 American 0 Female 48000 |
5. | 2 European 0 Female 48000 |
6. | 2 Japanese 1 Female 48000 |
|------------------------------------------|
7. | 3 American 1 Male 53000 |
8. | 3 European 0 Male 53000 |
9. | 3 Japanese 0 Male 53000 |
+------------------------------------------+
For more information, type help cross and help cond() from Stata's command prompt.
Please see dataex in Stata for how to produce data examples useful in web forums. (If necessary, install first using ssc install dataex.)
This could be an exercise in using fillin followed by filling in the missings.
* Example generated by -dataex-. To install: ssc install dataex
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
fillin id car
foreach v in sex income {
bysort id (_fillin) : replace `v' = `v'[1]
}
list , sepby(id)
+-------------------------------------------+
| id car sex income _fillin |
|-------------------------------------------|
1. | 1 European Male 45000 0 |
2. | 1 American Male 45000 1 |
3. | 1 Japanese Male 45000 1 |
|-------------------------------------------|
4. | 2 Japanese Female 48000 0 |
5. | 2 European Female 48000 1 |
6. | 2 American Female 48000 1 |
|-------------------------------------------|
7. | 3 American Male 53000 0 |
8. | 3 European Male 53000 1 |
9. | 3 Japanese Male 53000 1 |
+-------------------------------------------+
A provisional solution using Pandas in Python is the following:
1) Open the base with:
df = pd.read_stata("mybase.dta")
2) Use the code of the accepted answer of this question.
3) Save the base:
df.to_stata("newbase.dta")
If one wants to use dummy variables, reshape also is an option.
clear
input byte id str10 car str8 sex long income
1 "European" "Male" 45000
2 "Japanese" "Female" 48000
3 "American" "Male" 53000
end
tabulate car, gen(choice)
reshape long choice, i(id)
label define car 2 "European" 3 "Japanese" 1 "American"
drop car
rename _j car
label values car car
list, sepby(id)
+------------------------------------------+
| id car sex income choice |
|------------------------------------------|
1. | 1 American Male 45000 0 |
2. | 1 European Male 45000 1 |
3. | 1 Japanese Male 45000 0 |
|------------------------------------------|
4. | 2 American Female 48000 0 |
5. | 2 European Female 48000 0 |
6. | 2 Japanese Female 48000 1 |
|------------------------------------------|
7. | 3 American Male 53000 1 |
8. | 3 European Male 53000 0 |
9. | 3 Japanese Male 53000 0 |
+------------------------------------------+