Since I am a Stata beginner, I don't know how to import my data, but I just wanted some help to understand my mistake.
I would like to create a table where the first column displays the mean characteritics of California, the second column the mean characteritics of all other states (unweighted by their population), and the last column reports the p-value of the differences in means. The variable California is a dummy variable where 1 equals to California and 0 equals to 38 others states.
Here is the result :
estpost ttest cigsale lnincome beer age15to24 retprice, by(California)
esttab using ta.rtf, cell("mu_1 mu_2 p") label ///
title(Mean characteristics of California and 38 control states) ///
collabels(none) unstack noobs nonumber mtitles("California" "2" "P value") ///
addnote("All variables are averaged from 1980 to 1988. Beer consumption is averaged from 1984 to 1988. GDP per capita is measured in 1997 dollars, retail prices are measured in cents, beer consumption is measured in gallons, and cigarette sales are measured in packs")
Result
The problem is that when I put mtitles ("California" "Average of 38 control states" "P value"), it only works for the column California as you can see in the picture. The two other columns do not have the titles "Average of 38 control states" and "P value". I have tried many commands but there is only a title for the California column.
Could you please help me figure out what is going on?
This is one model with three columns, so instead of mtitles you can use collabels.
collabels("California" "2" "P value")
Related
I have tried many different combinations of sktest and sadly nothing works.
I was almost certain that sktest will work with by combination but it doesn't.
The issue is: I have binary data gender (male 0 and female 1) and I want to measure the skewness of returns for each (male and female) in the variable returns. Can you please advise?
I was hoping for a result similar to what we get when we run e.g. by gender: summarize returns
Different questions are bundled together here.
Testing
If you want to run sktest for different groups, you can just repeat the command
sysuse auto, clear
sktest price if foreign == 1
sktest price if foreign == 0
or write your own wrapper program to do the same. sktest in essence shows P-values but no summary measures.
Or do something like this:
preserve
statsby , by(foreign) : sktest price
list
restore
Measuring
If you want to see (moment-based) skewmess measures, you can just repeat summarize
bysort foreign: summarize price, detail
A wrapper is already available on SSC that is more selective.
moments price, by(foreign)
----------------------------------------------------------------------
Group | n mean SD skewness kurtosis
----------+-----------------------------------------------------------
Domestic | 52 6072.423 3097.104 1.778 5.090
Foreign | 22 6384.682 2621.915 1.215 3.555
----------------------------------------------------------------------
.
Warnings
Stata uses one estimator for moment-based skewness. There are others.
There are many ways to measure skewness. Those others mentioned in Section 7 of this paper are not a complete list; perhaps the most important omission is L-skewness (see lmoments from SSC).
Objective:
Looking to generate a column measure (DYNAMIC BALANCE) which reads the first row of BALANCE, and calculate the balance from subsequent rows using AMOUNT column.
Problems encountered:
Measure should ignore the AMOUNT on the first row and only capture the BALANCE into a variable.
My first row formula grabs both values for 1/1/2021 rather than only the top most row balance value.
Date
Description
AMOUNT
BALANCE
1/1/2021
Dog Food
-20.00
980.00
1/1/2021
McDonalds
-30.00
950.00
1/5/2021
Pay Day
1000.00
1950.00
1/8/2021
Dog Food
-20.00
1930.00
1/10/2021
Medical
-1000.00
930.00
1/18/2021
McDonalds
-30.00
900.00
1/21/2021
Pay Day
1000.00
1900.00
1/31/2021
Dog Food
-50.00
1850.00
2/2/2021
McDonalds
-40.00
1810.00
You are my hero!
Here is the solution. Funny how the solution is so simple in DAX and in hindsight. M-Code would have been more complicated. DAX is dynamic to the filter which is necessary.
New Measure:
RunningTotal = FIRSTNONBLANK(bankdownload[Balance],0)+SUM(bankdownload[Amount])-FIRSTNONBLANK(bankdownload[Amount],0)
I have some data with two different groupd of patients automatically exported from a diagnostic tool.
Variables are automatically nominated by the diagnostic tool (e.g. L1DensityWholeImage, L1WholeImageSHemi, L1WholeImageIHemi , L1WholeETDRS ,[...], DeepL2StartLayer, L2Startoffsetum, L2EndLayer, [...], Perimeter, AcircularityIndex )
I have to perform a Rank-sum test (or Mann-Whitney U test) with all the variables (> of 80) by group.
Normally, I should write each single analysis like that:
ranksum L1DensityWholeImage, by(Group)
ranksum L1WholeImageSHemi, by(Group)
ranksum L1WholeImageIHemi, by(Group)
ranksum L1WholeETDRS, by(Group)
Is there any way or code to write the command with a varlist? And maybe to obtain only 1 output result with all the p value?
e.g.: ranksum L1DensityWholeImage L1WholeImageSHemi L1WholeImageIHemi L1WholeETDRS, DeepL2StartLayer L2Startoffsetum L2EndLayer Perimeter AcircularityIndex, by(Group)
A short answer is write a loop and customise output.
Here is a token example which you can run.
sysuse auto, clear
foreach v of var mpg price weight length displacement {
quietly ranksum `v', by(foreign) porder
scalar pval = 2*normprob(-abs(r(z)))
di "`v'{col 14}" %05.3f pval " " %6.4e pval " " %05.3f r(porder)
}
Output is
mpg 0.002 1.9e-03 0.271
price 0.298 3.0e-01 0.423
weight 0.000 3.8e-07 0.875
length 0.000 9.4e-07 0.862
displacement 0.000 1.1e-08 0.921
Notes:
If your variable names are longer, they will need more space.
Displaying P-values with fixed numbers of decimal places won't prepare you for the circumstance in which all displayed digits are zero. The code exemplifies two forms of output.
The probability that values for the first group exceed those for the second group is very helpful in interpretation. Further summary statistics could be added.
Naturally a presentable table needs more header lines, best given with display.
I am exploring an effect that I think will vary by GDP levels, from a data set that has, vertically, country and year (1960 to 2015), so each country label is on 55 rows. I ran
sort year
by year: egen yrank = xtile(rgdp), nquantiles(4)
which tags every year row with what quartile of GDP they were in that year. I want to run this:
xtreg fiveyearg taxratio if yrank == 1 & year==1960
which would regress my variable (tax ratio) against some averaged gdp data from countries that were in the bottom quartile of GDPs in 1960 alone. So even if later on they grew enough to change ranks, the later data would still be in the regression pool. Sadly, I cannot get this code, or any variation, to run.
My current approach is to try to generate some new variable that would give every row with country label X a value of 1 if they were in the bottom quartile in 1960, but I can't get that to work either. i have run out of ideas, so I thought I would ask!
Based on your latest comment, which describes the (un)expected behavior:
clear
set more off
*----- example data -----
input ///
country year rank
1 1960 2
1 1961 1
1 1962 2
2 1960 1
2 1961 1
2 1962 1
3 1960 3
3 1961 3
3 1962 3
end
list, sepby(country)
*----- what you want -----
// tag countries whose first observation for -rank- is 1
// (I assume the first observation for -year- is always 1960)
bysort country : gen toreg = rank[1] == 1
list, sepby(country)
// run regression conditional on -toreg-
xtreg ... if toreg
Check help subscripting if in doubt.
I'm pretty new to Stata.
I have a set of observations of the form "Country GDP Year". I want to create a new variable GDP1960, which gives the GDP in 1960 of each country for each year:
USA $100m 1960 USA $100m 1960 $100m
USA $200m 1965 --> USA $200m 1965 $100m
Canada $60m 1960 Canada $60m 1960 $60m
What's the right syntax to make this happen? (I assume egen is involved in some mysterious way)
You've found a solution with cond(), but here's a couple of suggestions that might make modeling your data easier and help you avoid problems with issues that might arise when sorting by creating your rank variable (and I've got the egen solution that you asked about below):
Paste the code below into your do-file editor and run it:
*---------------------------------BEGIN EXAMPLE
clear
inp str20 country str10 gdp year
"USA" "$100m" 1960
"USA" "$200m" 1965
"Canada" "$60m" 1960
"Canada" "$120m" 1965
"USA" "$250m" 1970
"Mexico" "$90m" 1970
"Canada" "$800m" 1970
"Mexico" "$160m" 1960
"Mexico" "$220m" 1965
"Mexico" "$350m" 1975
end
//1. destring gdp so that we can work with it
destring gdp, ignore("$", "m") replace
//2. Create GDP for 1960 var:
bys country: g x = gdp if year==1960
bys country: egen gdp60 = max(x)
drop x
**you could also create balanced panels to see gaps in your data**
preserve
ssc install panels
panels country year
fillin country year
li //take a look at the results win. to see how filled panel data would look
restore
//3. create a gdp variable for each year (reshape the dataset)
drop gdp60
reshape wide gdp, i(country) j(year)
**much easier to use this format for modeling
su gdp1970
**here's a fake "outcome" or response variable to work with**
g outcome = 500+int((1000-500+1)*runiform())
anova outcome gdp1960-gdp1970 //or whatever makes sense for your situation
*---------------------------------END EXAMPLE
A one-line solution is
egen gdp60 = mean(gdp / (year == 1960)), by(country)
The trick here is the division by the expression year == 1960. This is true for 1960, in which case we divide by 1, which leaves the gdp for that year unchanged. It is false for all other years, in which case we divide by 0. That sounds crazy, but the consequence whenever we divide by zero is just missing values, which will be ignored by egen's mean() function.
You could use other egen functions, as in this case there should be at most one value for 1960 for each country, so e.g. max(), min(), total() should all work too. (If a country has no value for 1960, or a missing value, we will end up with missing, which is precisely as it should be.)
Discussion at http://www.stata-journal.com/article.html?article=dm0055
Well, I found a solution in the end. It relies on the fact that generate and replace work on the data in its sorted order, and that you can refer to the current observation with _n.
gen rank = 100
replace rank = 50 if year == 1960
gen gdp60 = .
sort country rank
replace gdp60 = cond(iso == iso[_n-1], gdp60[_n-1], gdp[_n])
drop rank
sort country year
EDIT: A more direct solution with the same flavour:
gen wanted = year == 1960
bysort country (wanted) : gen gdp60 = gdp[_N]
drop wanted
sort country year
Here wanted will be 1 for 1960 and 0 otherwise.
I can't think of anything shorter than these two lines:
gen temp = gdp if year == 1960
by country : egen gdp60 = max(temp)
If you want a variable for each year (e.g., gdp60, gdp61, gdp62,...) then you probably should use reshape