I have done this many times with Excel and Java... This time I need to do it using Stata because it is more convenient to preserve variables' labels. How can I restructure dataset_1 into dataset_2 below?
I need to transform the following dataset_1:
into dataset_2:
I know one way, which is a little awkward... I mean, I could expand all the observations, then create variable obsNo, and then, rename variables...is there any better way?
Stata is wonderful at this sort of thing, it's a simple reshape. Your data is a little awkward, as the reshape command was designed to work with variables where the common part of the variable name (in your case, Wage) comes first. In the documentation for reshape, "Wage" would be the stub. The part following Wage is required to be numeric. If you first sort your variable names by
rename (raceWhiteWage raceBlackWage raceAsianWage) (Wage1 Wage2 Wage3)
Then you can do:
reshape long Wage, i(state year) j(race)
That should give you the output your are looking for. You will have a column labeled "race", with values of 1 for White, 2 for Black, and 3 for Asian.
Related
In Stata, is there a way to redirect the data that a command does into a table instead of a graph?
Example: if someone created a normal probability distribution of data with the pnorm var_name command, is there a way to redirect the data so that instead of appearing in a graph, it appears in a table?
To add to #Noobie's answer:
Different commands work in different ways. There's no better short summary.
What you can look out for includes
generate() options that produce new variables. (There is absolute rule that the options have this name, but that or a similar name is the most common single variety.)
Options that allow saving results to new datasets.
Saved results, especially those visible after return list or ereturn list. These can be quite elaborate, e.g. saving of matrices of counts after tabulate.
More broadly, Stata commands aren't functions! One characteristic of a function, as so named in many languages or programs, is that there is a result, with special cases where the result is void or null. There clearly are statistical programs which in broad terms hinge on calling functions which have results, and what you see displayed is often a side-effect of that. Stata commands don't work like that in the sense that the results of a program can be various. In the case of commands designed just to show something, the "result" may be a display. It's worth noting that Mata, which underlies and underpins Stata, is more recognisably a C-like language, with (e.g.) many matrix extensions, which is based on functions (and much else).
Yes and no. It really depends on the command you are using. You should look at the help files first.
For instance, pnorm does not allow that. You can create the data yourself using the formula for pnorm described in the help file, where the cumulative distribution at some point is plotted against the so-called plotting position.
Other Stata commands allow you to generate the points directly. This is the case for kdensity for instance.
I have a dataset of membership information, and I want to keep only the people who have been continuously enrolled for the entire year. There are 12 variables for each person, one for each month of the year with how many days during that month they were enrolled. Is there a way to make a subset of the data for just those with a value >1 for each of the month variables?
Thanks!
SAS has various summary functions that might well be what you're looking for. See min() (minimum) in particular, as it will allow you to find the minimum of several variables. You may also want to consider nmiss() (number of missing values) and n() (number of non-missing values) if you have to deal with missing values in your data.
Summary functions can be passed lists of variables like this (in a data step):
minimum = min(var1, var2, var3);
However, that can become long winded if you need to use a lot of variables. Fortunately, SAS provides several ways to reference lists of variables to make things neater. You can read about these variable lists here. To use them in a summary function use the of qualifier:
minimum = min(of var1-var12);
maximum = max(of var:);
blanks = nmiss(of _NUMERIC_);
Finally you will want to use your new found data to decide whether what data to include. To do this in a data step look at the output statement (user guide):
if min(of var:) > 1 then output;
Or if you feel like learning a bit more about SAS's syntax you could try using an implicit output by reading through the last link.
In general it's preferred to ask specific questions and show your current work on SO, and I'd advise using google to answer your basic questions while you're learning the fundamentals. There is plenty of great documentation available to help you out there.
totalSUPPLY= sum(of supply1-supply485);
Ive got this simple calculation to make (in SAS) from a table that Ive transposed (hence the variable names). I have to do this several times, and the the number of supply variables is not the same for each calculation. I.e. in the above example its 485, but I do it later in my analysis and its 350.
My question: Is there a way to 'wildcard' the number of 'supply' columns. Basically, I want something like this (but this doesnt work): totalSUPPLY= sum(of supply1-supply%);
Also: If there is an easier way do the same Im open (and would actually prefer) that.
Thanks everyone!
data yoursummary;
set yourdata; /*dataset containing supply1-supply485*/
array supplies{*} supply:;
totalSUPPLY = sum(of supplies{*});
run;
N.B. using a : wildcard like this will only pick up matching variables that are present in the PDV at the point when you create the array, so the array definition has to come after the set statement. Also, it only works for variables with a common prefix, not those with a common suffix.
As Joe has pointed out, the following more concise code also works:
data yoursummary;
set yourdata; /*dataset containing supply1-supply485*/
totalSUPPLY = sum(of supplies:);
run;
Of course, if you declare an array it's then easier to do related things like checking how many variables are being added together, or looping through the variables in the array and applying the same logic to each one in turn.
I would like to wrap large matrix-style output, as per the corr commands wrap option in an ado file. Unfortunately, corr is not implemented as an ado-file, and pwcorr, which is implemented as an ado-file that I might model my approach on is missing the wrap option.
It would be useful to understand both how do do this for a fixed output width, and also useful to know how to base the definition of width for fixed output to the width of current window settings.
I use matlist for displaying most of my matrices, and it will also take care of wrapping.
More complicated output can be handled by being creative when making the matrix. I find the dotz option of matlist often helpful in that respect. Another trick I use a lot is the fact that a column and name can have two parts: an "equation name" which can be shared with multiple rows or columns and a "row or column name within the equation".
If you need more flexibility you can look at frmttable.
In Stata, is there a quick way to show the correlation between a variable and a bunch of dummies. In my data I have an independent variable, goals_scored in a game, and a bunch of dummies for stadium played. How can I show the correlation between the goals_scored and i.stadium in one table, without getting the correlations between stadiums, which I do not care about.
Here's one way:
#delimit;
quietly tab stadium, gen(D); // create dummies
foreach var of varlist D* {;
quietly corr goals_scored `var';
di as text "`: variable label `var'': " as result r(rho);
};
drop D*; // get rid of dummies
cpcorr from SSC (install with ssc inst cpcorr) supports minimal correlation tables, i.e. only the correlations between one set and another set, without the others. But it's an old program (2001) and doesn't support factor variables directly. The indicator variables (a.k.a. dummy variables) would have to exist first.
If you store all of the stadium variables in a local, you would probably loop through them to pull the correlations.
1.
If all stadium variables are placed next to each other in the dataset:
foreach s of varlist stadium1-stadium150 {
// do whatever
}
2a.
If the stadium variables are not next to each other, use order to get there.
2b.
If the variable names follow a pattern, there might be another workaround.
3.
I would not use correlation for this. Depending on the distribution of goals, I would consider something else.