Skewness in Stata - stata

I have tried many different combinations of sktest and sadly nothing works.
I was almost certain that sktest will work with by combination but it doesn't.
The issue is: I have binary data gender (male 0 and female 1) and I want to measure the skewness of returns for each (male and female) in the variable returns. Can you please advise?
I was hoping for a result similar to what we get when we run e.g. by gender: summarize returns

Different questions are bundled together here.
Testing
If you want to run sktest for different groups, you can just repeat the command
sysuse auto, clear
sktest price if foreign == 1
sktest price if foreign == 0
or write your own wrapper program to do the same. sktest in essence shows P-values but no summary measures.
Or do something like this:
preserve
statsby , by(foreign) : sktest price
list
restore
Measuring
If you want to see (moment-based) skewmess measures, you can just repeat summarize
bysort foreign: summarize price, detail
A wrapper is already available on SSC that is more selective.
moments price, by(foreign)
----------------------------------------------------------------------
Group | n mean SD skewness kurtosis
----------+-----------------------------------------------------------
Domestic | 52 6072.423 3097.104 1.778 5.090
Foreign | 22 6384.682 2621.915 1.215 3.555
----------------------------------------------------------------------
.
Warnings
Stata uses one estimator for moment-based skewness. There are others.
There are many ways to measure skewness. Those others mentioned in Section 7 of this paper are not a complete list; perhaps the most important omission is L-skewness (see lmoments from SSC).

Related

identify groups with few observations in paneldata models (stata)

How can I identify groups with few observations in panel-data models?
I estimated using xtlogit several random effects models. On average I have 26 obs per group but some groups only record 1 observation. I want to identify them and exclude them from the models... any suggestion how?
My panel data is set using: xtset countrycode year
Let's suppose your magic number for a big enough panel is 7 and that you fit a first model.
bysort countrycode : egen n_used = total(e(sample))
then gives you a count of how many observations were available and can be used, after which your criterion for a later model is if n_used >= 7
You could just go
bysort countrycode : gen n_available = _N
regardless of a model fit.
The differences are two-fold:
That last statement would disregard any missing values in the variables used in a model fit.
If you also used if and/or in to restrict model fit to particular subsets of observations, then e(sample) knows about that, but the last statement does not.

Count by groups and collapse

I have a dataset in Stata and want to count by group (loc_ID) and year. I used the following two lines of code:
egen count_obsv = tag(loc_ID year)
This adds a counter to my dataset (count_obsv) which is 1 (and 0 for every element that has the same combination of loc_ID and year) for every new combination.
Then I use:
collapse (sum) count_obsv, by(loc_ID year)
according to various Stata forum posts this should result in eg.:
loc_ID year count_obsv
1 2000 342
1 2001 23
2 2008 23
...
But my output is:
loc_ID year count_obsv
1 2000 1
1 2001 1
2 2008 1
...
What am I summarizing wrong?
When you call up the tag() function of the egen command, you assign the value 1 to just one of any number of observations with the same distinct values for the specified variables, and 0 to all the others. Then when you ask for the sum of those values in the same groups of observations, you get the group sums of one 1 and any number of 0s, and each sum is thus necessarily 1.
Your question is probably abstracted from some other calculations that worked as you expected, but if all you wanted was a dataset with frequencies, then
contract loc_ID year
would do that for you. If you wanted a dataset with summaries of other variables too, you would need something more like
collapse (count) count=foo (mean) mean=foo (sd) sd=foo, by(loc_ID year)
I doubt that any Statalist posts state otherwise. (I wrote tag() in 1999, and I am not aware of this as a misunderstanding.) There is a related but so to speak distinct problem where tag() comes in useful, which is counting distinct values (often called unique values).
sysuse auto, clear
egen tag = tag(foreign rep78)
egen distinct = total(tag), by(foreign)
tabdisp foreign, c(distinct)
would be a way to get at the number of distinct values of rep78 within categories of foreign.

Storing results of binomial confidence interval in Stata using by prefix

I am trying to calculate the 95% binomial Wilson confidence interval for the proportion of people completing treatment by year (dataset is line-listed for each person).
I want to store the results into a matrix so that I can use the putexcel command to export the results to an existing Excel spreadsheet without changing the formatting of the sheet. I have created a binary variable dscomplete_binary which is 0 for a person if treatment was not completed, and 1 if treatment was completed.
I have tried the following:
bysort year: ci dscomplete_binary, binomial wilson level(95)
This gives output of each year with the 95% confidence intervals. Previously I used statsby to collapse the dataset to store the results in variables but this clears the dataset from the memory and so I have to constantly re-open it.
Is there a way to run the command and store the results in a tabular format so that the data is stored in a similar way to this:
year mean LowerCI UpperCI
r1 2005 .7031588 .69229454 .71379805
r2 2006 .75532377 .74504232 .7653212
r3 2007 .78125924 .77125096 .79094833
r4 2008 .80014324 .79059798 .80935836
r5 2009 .81860977 .80955398 .82732689
r6 2010 .82641232 .81723672 .83522016
r7 2011 .81854123 .80955547 .82719356
r8 2012 .83497983 .82621944 .8433823
r9 2013 .85411799 .84527379 .86253893
r10 2014 .84461939 .83499599 .85377985
I have tried the following commands, which give different estimates to the binomial Wilson option:
svyset id2
bysort year: eststo: ci dscomplete_binary, binomial wilson level(95)
I think the postfile family of commands will help you here. This won't save your data into a matrix, but will save the results of the ci command into a new data set, which you name and whose structure you set. After the analysis is complete, you can load the data saved by postfile and export to Excel in the manner of your choosing.
For postfile, you analyze the data in a loop instead of using by or bysort.
Assuming the years in your data run 2005-2014, here is sample code:
/*make sure no postfile is open, in case a previous run did not close the file*/
cap postclose ci_results
/*create the postfile that will store results*/
postfile ci_results year mean lowerCI upperCI using ci_results.dta, replace
/*loop through years*/
forval y = 2004/2014 {
ci dscomplete_binary if year==`y', binomial wilson level(95)
/*store saved results from ci to postfile. Make sure the post statement contains results in the same order stated in postfile command.*/
post (`y') (r(mean)) (r(lb)) (r(ub))
}
/*close the postfile once you've looped through all the cases of interest*/
postclose ci_results
use ci_results.dta, clear
Once you load the ci_results.dta data into memory, you can apply any Excel exporting command you like.
This is a development of the suggestion already made to use statsby. The objections to it are quite puzzling, as it is easy to get back to the original dataset. There is some machine time in re-loading a dataset, but how much personal time has been spent in pursuit of an alternative?
Absent a dataset which we can use, I've provided a reproducible example.
If you wish to do this repeatedly, you'll write a more elaborate program to do it, which is what this forum is all about.
I leave how to export results to Excel as a matter for those so inclined: no details of what is wanted are provided in any case.
. sysuse auto, clear
(1978 Automobile Data)
. preserve
. statsby mean=r(mean) ub=r(ub) lb=r(lb), by(rep78) : ci foreign, binomial wilson level(95)
(running ci on estimation sample)
command: ci foreign, binomial wilson
mean: r(mean)
ub: r(ub)
lb: r(lb)
by: rep78
Statsby groups
----+--- 1 ---+--- 2 ---+--- 3 ---+--- 4 ---+--- 5
.....
. list
+----------------------------------------+
| rep78 mean ub lb |
|----------------------------------------|
1. | 1 0 .6576198 0 |
2. | 2 0 .3244076 0 |
3. | 3 .1 .2562108 .0345999 |
4. | 4 .5 .7096898 .2903102 |
5. | 5 .8181818 .9486323 .5230194 |
+----------------------------------------+
. restore
. describe
The describe results will show that we are back where we started.

EU-SILC database about education and experience

I am using EU-SILC database for 2008 for Greece. Firstly, I would like to use PE040 so as to create three dummies: primeduc for education on pre-primary AND primary school seceduc on lower secondary education +(upper) secondary + post-secondary non tertiary education and tereduc on 1st + 2nd tertiary stage.
Secondly, I would like to make a variable about working experience based on the idea exper=age-educ-6 where educ I would like sth about the years (generally) spent in education.
Any ideas of which commands I should use on stata???
What I've tried so far
About stata syntax:
tabulate PE040, gen(educ)
gen primeduc=educ1+educ2
gen seceduc=educ3+educ4+educ5
gen tereduc=educ6
Having defined lnwage as =log(PY010N/(PL060+PL070)) and age as =2008-PB140, I've tried to regress and it takes only into account 191 obs.
For your first question, I think you want a 0-1 indicator, equal to 1 if either of the indicated educational categories was recorded.
gen primeduc=educ1 | educ2
gen seceduc =educ3 |educ4 |educ5
The "|" stands for logical "or". For example, primeduc will be 1 if educ1 is 1 or educ2 is 1.

Stata: Combine table command with ttest and output latex

For regression output, I usually use a combination of eststo to store estimations, estadd to add the R2 and additional tests, then estab to output the lot.
I need to do the same with the table command. I need the mean, median and N for a variable across three by variables and would like to add stars for the result of a ttest==1 on the mean and signtest==1 on the median. I have three by variables, so I've been using table to collate the mean, median and N, which I'm calling like the following pseudo-code:
sysuse auto,clear
table foreign rep78 , ///
contents(mean price median price n price) format(%9.2f)
ttest price==1, by(foreign rep78)
signtest price=1, by(foreign rep78)
I've tried esttab and estpost to no avail. I've also looked at tabstat, tablemat and summarize as alternatives to table, but they don't allow three by variables.
How can I create this table, add the stars for the ttest and signtest p-values and output the full table?
The main point in your question seems to be producing a LaTeX table. However, you show "pseudo-code", that looks pretty much like Stata code, with the caveat that it is illegal.
In particular, for the ttest you can only have one variable in the by() option. But notice that ttest allows also the by: prefix (you can use both, in fact). Their reasons-to-be are different. On the other hand, signtest does not allow a by() option but it does allow the by: prefix. So you should probably clarify what you want to do before creating the table.
If you are trying to use the by: prefix in both cases and afterwards produce a table, you can create a grouping variable, and put the commands in a loop. In this way, you can try tabulating the saved results for each group using the ESTOUT module (by Ben Jann in SSC). Something like:
*clear all
set more off
sysuse auto
keep price foreign rep78
* create group variable
egen grou = group(foreign rep78)
* tests by group
forvalues i = 1/8 {
ttest price == 1 if grou == `i'
signtest price = 1 if grou == `i'
*<complete with estout syntax>
}
See help by, help egen (the group function), help estout and help saved results.