Summarise/ descriptive statistics for categorical data - stata

I have a dataset in Stata and would like to create a descriptive statistics table. The current problem I have is that my variables are both numerical and categorical. For the numerical variables, I know I can create a table easily with the mean, standard deviation and so on. I have just had a problem with categorical variables. For example, education encompasses 5 levels of different education and I would like to show the proportion of observations for each option within the education variable.This is just part of it. I wanted to create an overall table that has descriptive statistics for other variables, like gender, age, income, level of education and so on.

I like to use the user-contributed command table1 for this purpose. Type ssc install table1 to access the package.
sysuse auto
table1, vars(price contn \ rep78 cat)
+------------------------------------------------+
| Factor Level Value |
|------------------------------------------------|
| N 74 |
|------------------------------------------------|
| Price, mean (SD) 6,165.3 (2,949.5) |
|------------------------------------------------|
| Repair record 1978 1 2 (3%) |
| 2 8 (12%) |
| 3 30 (43%) |
| 4 18 (26%) |
| 5 11 (16%) |
+------------------------------------------------+
Type help table1 for additional options.

asdocx has a comprehensive template for creating table1. The template can summarize different types of variables such as continuous and binary / categorical variables in a single table. Table1 template allows different statistics with categorical / factor variables, continuous variables, and binary variables. The allowed statistics are given below:
mean Mean of the variable
sd Standard deviation of the variables
ci 95% Confidence interval
n Counts
N Counts
frequency Counts
percentage Count as Percentage of total *
% Count as percentage of total
The statistics presented in the above table can be selectively used with categorical, binary, and continuous variables. The default statistics for each type of variables are given below:
(1) Binary variables : Count (Percentages)
(2) Categorical variables : Count (Percentages)
(3) Continuous variables : Mean (95% confidence interval)
Table1 template also support survey weights. I have posted several examples on this page

Related

Skewness in Stata

I have tried many different combinations of sktest and sadly nothing works.
I was almost certain that sktest will work with by combination but it doesn't.
The issue is: I have binary data gender (male 0 and female 1) and I want to measure the skewness of returns for each (male and female) in the variable returns. Can you please advise?
I was hoping for a result similar to what we get when we run e.g. by gender: summarize returns
Different questions are bundled together here.
Testing
If you want to run sktest for different groups, you can just repeat the command
sysuse auto, clear
sktest price if foreign == 1
sktest price if foreign == 0
or write your own wrapper program to do the same. sktest in essence shows P-values but no summary measures.
Or do something like this:
preserve
statsby , by(foreign) : sktest price
list
restore
Measuring
If you want to see (moment-based) skewmess measures, you can just repeat summarize
bysort foreign: summarize price, detail
A wrapper is already available on SSC that is more selective.
moments price, by(foreign)
----------------------------------------------------------------------
Group | n mean SD skewness kurtosis
----------+-----------------------------------------------------------
Domestic | 52 6072.423 3097.104 1.778 5.090
Foreign | 22 6384.682 2621.915 1.215 3.555
----------------------------------------------------------------------
.
Warnings
Stata uses one estimator for moment-based skewness. There are others.
There are many ways to measure skewness. Those others mentioned in Section 7 of this paper are not a complete list; perhaps the most important omission is L-skewness (see lmoments from SSC).

DAX: How to correctly create a measure group from a range of dates?

I have a dataset more or less like this one:
DATE | VALUE
01/01/2011 | 100
02/01/2011 | 150
02/01/2011 | 550 --> Repeted on purpouse
.
.
12/01/2016 | 320
Now I need to have a calculated measure with only the values within a date range, I tried in many ways but with no success, the only one I managed to get it work is the follow DAX syntax:
consuntivo = CALCULATE(SUM(provadat[valori]);provadat[datazione]>=DATE(2015;01;01)&&provadat[datazione]<=DATE(2016;01;01))
but it generates the following:
So basically I need a DAX Query with distinct sum for each dates. How can I achieve that?
Two methods.
In the Table visualization you can choose Sum as the summarize option for the column valori.
Or using DAX, it'll be just simple as
consuntivo = SUM(provadat[valori])
You don't need to handle the date logic particularly because Power BI will handle it based on the context (data columns you used with the measure).
So basically what I was missing was to add filters.
xxx = Calculate(SUM(provadat[valori]);FILTER(VALUES(provadat);provadat[datazione] <= DATE(2017;01;01) && provadat[datazione] >= DATE(2016;01;01)))

How do I generate predicted counts from a negative binomial regression with a logged independent variable in Stata?

I have a set of data with a dependent variable that is a count, and several independent variables. My primary independent variable is large dollar values. If I divide the dollar values by 10,000(to keep the coefficients manageable), the models(negative binomial and zero-inflated negative binomial) run in Stata and I can generate predicted counts with confidence intervals. However, theoretically it is more logical to take the natural log of this variable. When I do that, the models still run but now predicted counts on range between 0.22-0.77 or so. How do I fix this so the predicted counts generate correctly?
Your question does not show any code or data. It's nearly impossible to know what is going wrong without these two ingredients. Your questions reads as "I did some stuff to this other stuff with surprising results." In order to ask a good question, you should replicate your coding approach with a dataset that everyone would have access to, like rod93.
Here's my attempt at that, which shows reasonably similar predictions with nbreg from both models:
webuse rod93, clear
replace exposure = exposure/10000
nbreg deaths exposure age_mos, nolog
margins
predictnl d1 =predict(n), ci(lb1 ub1)
/* Compare the prediction for the first obs by hand */
di exp(_b[_cons]+_b[age_mos]*age_mos[1]+_b[exposure]*exposure[1])
di d1[1]
gen ln_exp = ln(exposure)
nbreg deaths ln_e age_mos, nolog
margins
predictnl d2 =predict(n), ci(lb2 ub2)
/* Compare the prediction for the first obs by hand */
di exp(_b[_cons]+_b[age_mos]*age_mos[1]+_b[ln_e]*ln(exposure[1]))
di d2[1]
sum d? lb* ub*, sep(2)
This produces very similar predictions and confidence intervals:
. sum d? lb* ub*, sep(2)
Variable | Obs Mean Std. Dev. Min Max
-------------+---------------------------------------------------------
d1 | 21 84.82903 25.44322 12.95853 104.1868
d2 | 21 85.0432 25.24095 32.87827 105.1733
-------------+---------------------------------------------------------
lb1 | 21 64.17752 23.19418 1.895858 80.72885
lb2 | 21 59.80346 22.01917 10.9009 79.71531
-------------+---------------------------------------------------------
ub1 | 21 105.4805 29.39726 24.02121 152.7676
ub2 | 21 110.2829 29.16468 51.76427 143.856

Regression in Stata by industry: How to get categories in a variable as the title for the resulting regression output?

I have two variables in my firm-level dataset containing the industrial classification and the industry name to which that company belongs. For a given id_class, industry_name might be missing in the data. See below
| id_class | industry_name |
|----------|-------------------|
| 10 | auto |
| 11 | telecommunication |
| 12 | . |
I'm doing regressions by industry using the levelsof command to save each category in id_class to a local macro to allow me to loop through each category
levelsof id_class, (id_class_list)
foreach i of local id_class_list {
reg y x if id_class == `i'
}
I want to save the estimated coefficients for each regression to a table (I know how to do this part), but I want the table to have the title contained in the industry_name variable. How can I do this?
You can use the macro extended functions for extracting data attributes, such as variable value labels, like this:
sysuse auto
levelsof foreign, local(list)
foreach v of local list {
local vl: label (foreign) `v'
di "Car Origin is `vl':"
reg price if foreign==`v'
}
have a look at statsby : check out its help and manual entry

Sort observations in a custom order

I have a dataset that results from the joins between a few results from a proc univariate.
After some more joins, I have a final dataset with a variable called "Measure", which has the name of certain measures, like 'mean' and 'standard deviation', for example, and other variables each with values for these measures, representing a month in a certain year.
I'd like to sort these measures in a particular order and, for now, I'm doing a proc transpose, doing a retain to stabilish the order I want, and doing another transpose. The problem is that this a really naive solution and I feel it just takes longer than it should take.
Is there a simpler/more effective way to do this sort?
An example of what I want to do, with random values:
What I have:
Measures | 2013/01 | 2013/02 | 2013/03
Mean | 10 | 9 | 11
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
What I want:
Measures | 2013/01 | 2013/02 | 2013/03
Std Devi.| 1 | 1 | 1
Median | 3 | 5 | 4
Mean | 10 | 9 | 11
I hope I was clear enough.
Thanks in advance
Couple of straightforward solutions. First, you could simply add a variable that you sort by and then drop. Don't need to transpose, just do it in the data step or PROC SQL after the join. if measures='Mean' then sortorder=3; else if measures='MEdian' then sortorder=2;... then sort by sortorder and then drop it in the PROC SORT step.
Second, if you're using entirely numeric values, you can use PROC MEANS to do the sorting for you, with a custom format that defines the order (using NOTSORTED and order=data on the class statement) and idgroup functionality in PROC MEANS to do the sorting and output the right values. This is overkill in most cases, but if the dataset is huge it might be appropriate.
Third, if you're doing the joins in SQL, you can order by the variable that you input into a order you want - I can explain that in more detail if you find that the most useful.