Expand table by merging additional variables as columns - stata

I have a dataset that looks like this but with several more binary outcome variables: 
* Example generated by -dataex-. To install: ssc install dataex
clear
input long ID byte(Region Type Tier Secure Offshore Highland)
120034 12 1 2 1 0 1
120035 12 1 2 1 0 1
120036 12 1 2 1 0 1
120037 12 1 2 1 0 1
120038 41 1 2 1 0 0
120039 41 2 2 1 1 0
120040 41 2 1 0 1 0
120041 41 2 1 0 1 0
120042 41 2 1 0 1 0
120043 41 2 1 0 0 .
120044 65 2 1 0 0 .
120045 65 3 1 0 0 0
120046 65 3 1 1 0 0
120047 65 3 2 1 1 0
120048 65 3 2 1 0 0
120049 65 3 2 . 1 1
120050 25 3 2 . 1 1
120051 25 5 2 . 1 1
120052 25 5 1 . 0 1
120053 25 5 2 . 0 .
120054 25 5 2 0 0 .
120055 25 5 1 0 . 0
120056 25 5 1 0 . 0
120057 95 7 1 0 1 0
120058 95 7 1 0 1 0
120059 95 7 1 1 1 0
120060 95 7 2 1 0 1
120061 95 7 2 1 0 1
120062 59 7 2 1 0 1
120063 95 8 2 0 . 1
120064 59 8 1 0 . 1
120065 59 8 1 0 . 0
120066 59 8 1 1 . 0
120067 59 8 1 1 1 0
120068 59 8 2 1 1 0
120069 40 9 2 1 1 1
120070 40 9 2 1 0 1
120071 40 9 2 1 0 1
120072 40 9 1 0 0 1
end
I am creating a table with the community-contributed command tabout:
foreach v of var Secure Offshore Highland{
    tabout Region Type Tier `v' using `v'.docx ///
    , replace ///
    style(docx) font(bold) c(freq row) ///
    f(1) layout(cb) landscape nlab(Obs) paper(A4)
    }
It has both row frequencies, percentages and the totals.
However, I did not need all this information so i modified my code as follows:
foreach v of var Secure Offshore Highland{
    tabout Region Type Tier `v' using `v'.docx ///
    , replace ///
    style(docx) font(bold) c(freq row) ///
    f(1) layout(cb) h3(nil) h2(nil) dropc(2 3 4 5 7) landscape nlab(Obs) paper(A4)
    }
This produces what I need but both versions of my code create three individual tables for each outcome variables. I have to manually make one table combining the three tables keeping the left-most column, the % of "1" column and the right-most column showing the row-total. 
Can anyone help me out here regarding:
Merging all the tables in one go, keeping the exploratory variable labels on the left-most and the rowtotal on the right-most column.
Instead of deleting the columns except % of "1"s, I only want to have the desired column. Deleting columns seem so crude and dangerous.
Can i get this same output in Excel through "putexcel"? I tried following the wonderfully written blog by Chuck Huber. But I cannot figure out the "merging" part.
I came this far due to lots and lots of studying, especially Ian Watson's "User Guide for tabout Version 3" and Nicholas Cox's "How to face lists with fortitude". 
Cross-posted on Statalist.

You cannot do this readily with tabout -- custom tables require custom programming.
My advice is to create a matrix with whatever values you need and then use the (also) community-contributed command esttab to tabulate and export everything.
That said, what you want requires a lot of work but here is a simplified example based on your data:
matrix N = J(1, 2, .)
local i 0
foreach v in Region Type Tier {
local i = `i' + 1
tabulate `v' Secure, matcell(A`i')
matrix arowsum = J(1, rowsof(A`i'), 1) * A`i'
matrix A`i' = A`i' \ arowsum
if `i' > 1 local N \ N
matrix m1a = (nullmat(m1a) `N' \ A`i')
}
local i 0
foreach v in Region Type Tier {
local i = `i' + 1
tabulate `v' Offshore, matcell(B`i')
matrix browsum = J(1, rowsof(B`i'), 1) * B`i'
matrix B`i' = B`i' \ browsum
if `i' > 1 local N \ N
matrix m2a = (nullmat(m2a) `N' \ B`i')
}
local i 0
foreach v in Region Type Tier {
local i = `i' + 1
tabulate `v' Highland, matcell(C`i')
matrix crowsum = J(1, rowsof(C`i'), 1) * C`i'
matrix C`i' = C`i' \ crowsum
if `i' > 1 local N \ N
matrix m3a = (nullmat(m3a) `N' \ C`i')
}
matrix m1b = m1a * J(colsof(m1a), 1, 1)
matrix m2b = m2a * J(colsof(m2a), 1, 1)
matrix m3b = m3a * J(colsof(m3a), 1, 1)
matrix M1 = m1a, m1b
matrix M2 = m2a, m2b
matrix M3 = m3a, m3b
matrix K = J(1, 3, .)
matrix M = M1 \ K \ M2 \ K \ M3
You can then use esttab to export the results in Excel or Word:
esttab matrix(M)
---------------------------------------------------
M
c1 c2 c1
---------------------------------------------------
r1 0 4 4
r2 3 0 3
r3 1 3 4
r4 4 2 6
r5 2 4 6
r6 2 3 5
r7 3 3 6
r1 15 19 34
r1 . . .
r1 0 5 5
r2 5 1 6
r3 1 3 4
r4 3 0 3
r5 2 4 6
r6 3 3 6
r7 1 3 4
r1 15 19 34
r1 . . .
r1 13 4 17
r2 2 15 17
r1 15 19 34
r1 . . .
r1 . . .
r1 4 0 4
r2 3 2 5
r3 3 1 4
r4 2 4 6
r5 1 2 3
r6 4 2 6
r7 2 3 5
r1 19 14 33
r1 . . .
r1 5 0 5
r2 2 4 6
r3 3 3 6
r4 3 1 4
r5 3 3 6
r6 0 2 2
r7 3 1 4
r1 19 14 33
r1 . . .
r1 6 7 13
r2 13 7 20
r1 19 14 33
r1 . . .
r1 . . .
r1 0 4 4
r2 2 3 5
r3 0 4 4
r4 5 0 5
r5 4 2 6
r6 4 1 5
r7 3 3 6
r1 18 17 35
r1 . . .
r1 1 4 5
r2 4 0 4
r3 4 2 6
r4 2 2 4
r5 3 3 6
r6 4 2 6
r7 0 4 4
r1 18 17 35
r1 . . .
r1 13 3 16
r2 5 14 19
r1 18 17 35
---------------------------------------------------
You will have to generate the rest of the elements you want separately (including column and row names etc.) but the idea is the same. You will also have to play with the options in esttab to fine tune the desired final outcome.
Note that the above can be written more efficiently but I have kept everything separate in this answer so you can understand it.
EDIT:
If you are working with matrices as above you can also use putexcel easily:
putexcel A1 = matrix(M)

Related

One variable in kg and grams. another indicates which unit; how can I get new variable in kg?

In Stata quantity has inputs in both kg and grams. while unit =1 indicates kg and unit=2 indicates grams. How can I generate a new variable quantity_kg which converts all gram values into kg?
My existing dataset-
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
My expected dataset
input double(hhid quantity unit unit_price quantity_kg)
1 24 1 . 24
1 4 1 . 4
1 350 2 50 .35
1 550 2 90 .55
1 2 1 65 2
1 3.5 1 85 3.5
1 1 1 20 1
1 4 1 25 4
1 2 1 . 2
2 1 1 30 1
2 2 1 15 2
2 1 1 20 1
2 250 2 10 .25
2 2 1 20 2
2 400 2 10 .40
2 100 2 60 .10
2 1 1 20 1
The code below does what you want.
This looks like household data where one typically has to do a lot of unit conversions. They are also a common source of error so I have included the best practice of defining conversion rates and unit codes in locals. If you define this at one place, then you can reuse these locals in multiple places where you convert units. It is easy to spot typos in the rows with replace as you would notice if one row said kilo_rate but then gram_unit. In this simple example it might be overkill, but if you have many units and rates, then this is a neat way to avoid errors.
clear
input double(hhid quantity unit unit_price)
1 24 1 .
1 4 1 .
1 350 2 50
1 550 2 90
1 2 1 65
1 3.5 1 85
1 1 1 20
1 4 1 25
1 2 1 .
2 1 1 30
2 2 1 15
2 1 1 20
2 250 2 10
2 2 1 20
2 400 2 10
2 100 2 60
2 1 1 20
end
*Define conversion rates and unit codes
local kilo_rate = 1
local kilo_unit = 1
local gram_rate = 0.001
local gram_unit = 2
*Create the standardized variable
gen quantity_kg = .
replace quantity_kg = quantity * `kilo_rate' if unit == `kilo_unit'
replace quantity_kg = quantity * `gram_rate' if unit == `gram_unit'
// unit 1 means kg, unit 2 means g, and 1000 g = 1 kg
generate quantity_kg = cond(unit == 1, quantity, cond(unit == 2, quantity/1000, .))
Your example doesn't have any missing values on unit, but it does no harm to imagine that they might occur.
Providing a comment by way of explanation could be anywhere between redundant and essential for third parties.

Conditional summation in time-to-event data

I have the following data that has been prepared with stset. The resulting variables signify cohort entry and exit times along with event status. In addition, a numerical variable - prob has been calculated based on the riskset size.
For those subjects that are not cases (where _d == 0), I need to sum all values of the prob variable where _t falls within that subject's follow-up time.
For example, subject 8 enters the cohort at _t0 == 0 and exits at _t == 8. Between these times, there are three prob values 0.9, 0.875 and 0.875 - giving the desired answer for subject 8 as 2.65.
* Example generated by -dataex-. To install: ssc install dataex
clear
input long id byte(_t0 _t _d) float prob
1 0 1 0 .
2 0 2 0 .
3 1 3 1 .9
4 0 4 0 .
5 0 5 1 .875
6 0 6 1 .875
7 5 7 0 .
8 0 8 0 .
9 0 9 1 .8333333
10 0 10 1 .8
11 0 11 0 .
12 8 12 1 .6666667
13 0 13 0 .
14 0 14 0 .
15 0 15 0 .
end
The desired output would return all of the data with an additional variable signifying the summed values of prob.
Thanks so much in advance.

Pandas grouped differences with variable lags

I have a pandas data frame with three variables. The first is a grouping variable, the second a within group "scenario" and the third an outcome. I would like to calculate the within group difference between the null condition, scenario zero, and the other scenarios within the group. The number of scenarios varies between the different groups. My data looks like:
ipdb> aDf
FieldId Scenario TN_load
0 0 0 134.922952
1 0 1 111.787326
2 0 2 104.805951
3 1 0 17.743467
4 1 1 13.411849
5 1 2 13.944552
6 1 3 17.499152
7 1 4 17.640090
8 1 5 14.220673
9 1 6 14.912306
10 1 7 17.233862
11 1 8 13.313953
12 1 9 17.967438
13 1 10 14.051882
14 1 11 16.307317
15 1 12 12.506358
16 1 13 16.266233
17 1 14 12.913150
18 1 15 18.149811
19 1 16 12.337736
20 1 17 12.008868
21 1 18 13.434605
22 2 0 454.857959
23 2 1 414.372215
24 2 2 478.371387
25 2 3 385.973388
26 2 4 487.293966
27 2 5 481.280175
28 2 6 403.285123
29 3 0 30.718375
... ... ...
29173 4997 3 53.193992
29174 4997 4 45.800968
I will also have to write functions to get percentage differences etc. but this has me stumped. Any help greatly appreciated.
You can get the difference with the scenario 0 within groups using groupby and transform like:
df['TN_load_0'] = df['TN_load'].groupby(df['FieldId']).transform(lambda x: x - x.iloc[0])
df
FieldId Scenario TN_load TN_load_0
0 0 0 134.922952 0.000000
1 0 1 111.787326 -23.135626
2 0 2 104.805951 -30.117001
3 1 0 17.743467 0.000000
4 1 1 13.411849 -4.331618
5 1 2 13.944552 -3.798915
6 1 3 17.499152 -0.244315

Pandas groupping values to column

I have dataframe look like this:
a b c d e
0 0 1 2 1 0
1 3 0 0 4 3
2 3 4 0 4 2
3 4 1 0 4 3
4 2 1 3 4 3
5 3 2 0 3 3
6 2 1 1 1 0
7 1 1 0 3 3
8 3 3 3 3 4
9 2 3 4 2 2
I do following command:
df.groupby('A').sum()
And i get:
b c d e
a
0 1 2 1 0
1 1 0 3 3
2 5 8 7 5
3 9 3 14 12
4 1 0 4 3
And after that I want to access
labels = df['A']
But I have an error that there are no such column.
So does pandas have some syntax to get something like this?
a b c d e
0 0 1 2 1 0
1 1 1 0 3 3
2 2 5 8 7 5
3 3 9 3 14 12
4 4 1 0 4 3
I need to sum all values of columns b, c, d, e to column a with the relevant index
You can just access the index with df.index, and add it back into your dataframe as another column.
grouped_df = df.groupby('A').sum()
grouped_df['A'] = grouped_df.index
grouped_df.sum(axis=1)
Alternatively, groupby has 'as_index' option to keep the column 'A'
groupby('A', as_index=False)
or, after groupby, you can use reset_index to put the column 'A' back.

Stata: Capture p-value from ranksum test

When I run return list, all after running a ranksum test, the count and z-score are available, but not the p-value. Is there any way of picking it up?
clear
input eventtime prefflag winner stakechange
1 1 1 10
1 2 1 5
2 1 0 50
2 2 0 31
2 1 1 51
2 2 1 20
1 1 0 10
2 2 1 10
2 1 0 5
3 2 0 8
4 2 0 8
5 2 0 8
5 2 1 8
3 1 1 8
4 1 1 8
5 1 1 8
5 1 1 8
end
bysort eventtime winner: tabstat stakechange, stat(mean median n) columns(statistics)
ranksum stakechange if inlist(eventtime, 1, 2) & inlist(winner, 0, .), by (eventtime)
return list, all
Try computing it after ranksum:
scalar pval = 2 * normprob(-abs(r(z)))
display pval
The answer is by #NickCox:
http://www.stata.com/statalist/archive/2004-12/msg00622.html
The Statalist archive is a valuable resource.