In Stata, is there a quick way to show the correlation between a variable and a bunch of dummies. In my data I have an independent variable, goals_scored in a game, and a bunch of dummies for stadium played. How can I show the correlation between the goals_scored and i.stadium in one table, without getting the correlations between stadiums, which I do not care about.
Here's one way:
#delimit;
quietly tab stadium, gen(D); // create dummies
foreach var of varlist D* {;
quietly corr goals_scored `var';
di as text "`: variable label `var'': " as result r(rho);
};
drop D*; // get rid of dummies
cpcorr from SSC (install with ssc inst cpcorr) supports minimal correlation tables, i.e. only the correlations between one set and another set, without the others. But it's an old program (2001) and doesn't support factor variables directly. The indicator variables (a.k.a. dummy variables) would have to exist first.
If you store all of the stadium variables in a local, you would probably loop through them to pull the correlations.
1.
If all stadium variables are placed next to each other in the dataset:
foreach s of varlist stadium1-stadium150 {
// do whatever
}
2a.
If the stadium variables are not next to each other, use order to get there.
2b.
If the variable names follow a pattern, there might be another workaround.
3.
I would not use correlation for this. Depending on the distribution of goals, I would consider something else.
Related
The example below reproduces my problem. There is a string variable which takes several values. I want to create a global list and iterate over it in a loop. But it does not work. I've tried several versions without success. Here is the example code:
webuse auto, clear
levelsof make // list of car makes
global MAKE r(levels) // save levels in global list
foreach i in $MAKE { // loop some command over saved list
sum if make == "`$MAKE'" // ERROR 198, invalid 'Concord'
}
Using "`$MAKE'" or $MAKE does not yield desired output.
Any ideas of what am I doing wrong?
Normally, for lists to work, they should be saved as in A B C D [...]. In my case, levelsof produces a list of the following kind:
di $MAKE
`"AMC Concord"' `"AMC Pacer"' `"AMC Spirit"' `"Audi 5000"' `"Audi Fox"' `"BMW 320i"' [...]
So clearly not what is needed. But not sure how to get what I need.
Here is a solution. Note that I am using a local instead of a global. The difference is only scope. Only use global if you need to reference the value across do-files. You can remove the display lines below.
*Sysuse reads this data from disk, it comes with all Stata installations
sysuse auto, clear
*Use levelsof, and assign the returned r(levels) using a = to the local
levelsof make
local all_makes = r(levels)
*Loop over the local like this. Note that foreach creates a local, in this
*case called this_make that stores the elements in the local one per iteration
foreach this_make of local all_makes {
display "`this_make'"
sum if make == "`this_make'"
}
If global is what you need, then you simply change it to this:
*Sysuse reads this data from disk, it comes with all Stata installations
sysuse auto, clear
*Use levelsof, and assign the returned r(levels) using a = to the global
levelsof make
global all_makes = r(levels)
*Loop over the global like this. Note that foreach creates a local, in this
*case called this_make that stores the elements in the global one per iteration
foreach this_make of global all_makes {
display "`this_make'"
sum if make == "`this_make'"
}
There is a fine accepted answer but plenty more can be said. See for example this FAQ.
I am positive about levelsof as its original author, but for the purpose specified, to loop over the levels of a variable, it can be a lot cleaner to use egen, group() and loop over the integer levels of that variable. See the FAQ just linked for more. The example in the original question is a case in point, as looping over distinct string values can be tricky with a need to use double quotes " " and to watch out for spaces and so forth.
The underlying problem is not revealed but an extra comment is to underline that by: and its sibling commands such as statsby or commands similar in spirit such as rangestat from SSC offer, in effect, looping without looping.
I am running a panel data regression. I want to run a for loop to obtain regression results where every time one of the entities has been dropped, so that I can see, by comparing the different coefficients, whether my results are driven by a particular entity or they are consistent across the sample. I am currently using this for loop
forvalues i = 1/19{
use "sample_seven.dta", clear
drop if countryid == i
xtscc ln_gdp tech population inflation tradebalance i.year, fe}
However, when I run the above code what I get is 19 regressions where only two observations have been dropped in each of them.
use "sample_seven.dta", clear
forvalues i = 1/19 {
di _n "Omitting `i'"
xtscc ln_gdp tech population inflation tradebalance i.year if countryid != `i', fe}
}
The error in your code was not using single quotation marks to specify the local macro i. Your code worked, so a reference to i was interpreted as a reference to some variable, possibly inflation.
Beyond that, there is no need to read in the dataset repeatedly. But it would help to spell out which country is being omitted.
This relates to a general question I'm asking myself, how can I use the results of some code in another code if Stata does not create new objects except these clandestine locals and globals?
I would like to combine:
di c(k)
and:
expand
which I R I would simply do by writing something like expand(di c(k)). How does Stata take care of wrapped functions?
edit: I'm fine with using locals and globals but I don't always know how to call them into a function.
edit2: for everyone else who has trouble keeping track of 'clandestine' globals and locals: macro list
The difficulty you have in using locals, globals, scalars, saved results is not obvious from your question. An example is:
clear
set more off
sysuse auto
keep rep78
summarize
return list
expand r(max)
Saved results may disappear when other commands are issued, but you can save them into a local, for example, and use them later:
local rmax = r(max)
display `rmax'
expand `rmax'
I am trying to execute the following code:
forval i = 1/51 {
// number of households
by hhid, sort: gen nvals = _n==1
count if (nvals & stateID == `i')
local stateTotalHH = r(N)
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
drop nvals
}
Everything works fine except if is not allowed with sum(). How can I estimate the total or the sum of all values in numper variable for each state and at household level?
ps:
I cannot use collapse numper, by(stateID) because I have other estimations
also, I cannot do the following: duplicates drop hhid, force
Your problem does not even call for sum() with if, so it is best to start at the beginning.
Reconstructing your problem, which is not well explained,
You have observations for individuals within households (identifier hhid) within 50 states of the USA and the District of Columbia (identifier stateID).
You have a variable numper, the number of persons per household, and you want the average per state.
Observations are repeated for each individual in a household, so it is necessary to use just one observation per household.
You can tag each household once by
egen tag = tag(hhid)
The average as a new variable would be
egen avPersonHH = mean(numper/tag), by(stateID)
Stata is going to average numper/tag which variously will be numper/1 and numper/0; the missings from the latter division will just be ignored, which is what is wanted.
That variable is repeated for each household. To see just one value for each stateID,
tabdisp stateID, cell(avPersonHH)
What is wrong with your code? Here is a partial list:
a. No loop is required.
b. If it were, the statement by hhid, sort: gen nvals = _n==1 should not be repeated.
c. sum() is a function for cumulative sums across observations, not what you want here.
d. The line
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
would at best calculate one number, but the if condition is misplaced. if whatever local ... often makes sense in Stata, but putting if on the right of a local definition only makes sense for manipulating text containing commands.
Your comment on this line misses these basic misconceptions, c. and d.
e. You were aiming to have collected 51 values of averages in as many local macros, but still need to put them somewhere useful.
f. Separate calculation of totals and numbers is not required, as you can get Stata to calculate the mean for you.
(LATER) This code plays along step by step with your aversion to using collapse and duplicates, the grounds for which are not stated. But most experienced Stata users would be happy to use brute force:
duplicates drop hhid, force
collapse numper, by(stateID)
and then merge back. That solution is not only direct, but also uses fewer idiosyncratic Stata details, which can take time to figure out.
I have done this many times with Excel and Java... This time I need to do it using Stata because it is more convenient to preserve variables' labels. How can I restructure dataset_1 into dataset_2 below?
I need to transform the following dataset_1:
into dataset_2:
I know one way, which is a little awkward... I mean, I could expand all the observations, then create variable obsNo, and then, rename variables...is there any better way?
Stata is wonderful at this sort of thing, it's a simple reshape. Your data is a little awkward, as the reshape command was designed to work with variables where the common part of the variable name (in your case, Wage) comes first. In the documentation for reshape, "Wage" would be the stub. The part following Wage is required to be numeric. If you first sort your variable names by
rename (raceWhiteWage raceBlackWage raceAsianWage) (Wage1 Wage2 Wage3)
Then you can do:
reshape long Wage, i(state year) j(race)
That should give you the output your are looking for. You will have a column labeled "race", with values of 1 for White, 2 for Black, and 3 for Asian.