Keep other variables when executing get_dummies in Pandas - python-2.7

I have a DataFrame with an ID variable and another categorical variable. I want to create dummy variables out of the categorical variable with get_dummies.
dum = pd.get_dummies(df)
However, this makes the ID variable disappear. And I need this ID variable later on to merge to other data sets.
Is there a way to keep other variables. In the documentation of get_dummies I could not find anything. Thanks!

You can also copy the original column into a new one before executing get_dummies. E.g.,
df['dum_orig'] = df['dum']
df = pd.get_dummies(df, columns=['dum'])

I found the answer. You can concatenate the dummies data set to the original data set like shown below. As long as you don't re-order the data in the meantime.
df = pd.concat([df, dum], axis=1)

Related

Iterate over list produced with levelsof

The example below reproduces my problem. There is a string variable which takes several values. I want to create a global list and iterate over it in a loop. But it does not work. I've tried several versions without success. Here is the example code:
webuse auto, clear
levelsof make // list of car makes
global MAKE r(levels) // save levels in global list
foreach i in $MAKE { // loop some command over saved list
sum if make == "`$MAKE'" // ERROR 198, invalid 'Concord'
}
Using "`$MAKE'" or $MAKE does not yield desired output.
Any ideas of what am I doing wrong?
Normally, for lists to work, they should be saved as in A B C D [...]. In my case, levelsof produces a list of the following kind:
di $MAKE
`"AMC Concord"' `"AMC Pacer"' `"AMC Spirit"' `"Audi 5000"' `"Audi Fox"' `"BMW 320i"' [...]
So clearly not what is needed. But not sure how to get what I need.
Here is a solution. Note that I am using a local instead of a global. The difference is only scope. Only use global if you need to reference the value across do-files. You can remove the display lines below.
*Sysuse reads this data from disk, it comes with all Stata installations
sysuse auto, clear
*Use levelsof, and assign the returned r(levels) using a = to the local
levelsof make
local all_makes = r(levels)
*Loop over the local like this. Note that foreach creates a local, in this
*case called this_make that stores the elements in the local one per iteration
foreach this_make of local all_makes {
display "`this_make'"
sum if make == "`this_make'"
}
If global is what you need, then you simply change it to this:
*Sysuse reads this data from disk, it comes with all Stata installations
sysuse auto, clear
*Use levelsof, and assign the returned r(levels) using a = to the global
levelsof make
global all_makes = r(levels)
*Loop over the global like this. Note that foreach creates a local, in this
*case called this_make that stores the elements in the global one per iteration
foreach this_make of global all_makes {
display "`this_make'"
sum if make == "`this_make'"
}
There is a fine accepted answer but plenty more can be said. See for example this FAQ.
I am positive about levelsof as its original author, but for the purpose specified, to loop over the levels of a variable, it can be a lot cleaner to use egen, group() and loop over the integer levels of that variable. See the FAQ just linked for more. The example in the original question is a case in point, as looping over distinct string values can be tricky with a need to use double quotes " " and to watch out for spaces and so forth.
The underlying problem is not revealed but an extra comment is to underline that by: and its sibling commands such as statsby or commands similar in spirit such as rangestat from SSC offer, in effect, looping without looping.

Create Lag variable with three conditions

-I need a lead variable based on 3 conditions. IF variable RoaDLM has a number and IF the Co_ID is the same as the lag(co_id) and IF CEO = lag(ceo), I need a lead variable: Lead1
-i sort descending to create lag variable
-Every thing else should be '.'
-here is my code:
data RoaReg;
set RoaReg;
by CO_ID descending fyear;
if RoaDlm ne 0 and Co_ID = lag(CO_ID) and ceo=ceo then
Lead1 = lag(ROA);
else if RoaDlm= 0 then
Lead1='.';
run;
-Anyway, this does not work. Thanks!
Theres a couple of issues with your code.
Do not use the same data set name in the SET and DATA statements. This is a recipe for errors that are difficult to debug.
Lag() cannot be calculated conditionally, use it always and set to missing when necessary.
data RoaReg2;
set RoaReg;
by CO_ID descending fyear;
Lead1 = lag(ROA);
if RoaDlm= 0 then call missing (lead1);
run;
This is the correct version of your code, or my best guess. Providing sample data would help for sure.
Based on what I understood, you need a lead variable based on few conditions - two being lagged value of the variables.
You don't have a lead function in SAS, as per my knowledge. You can use proc expand for that purpose. And, you did not mention about the variable for which you want a lead - so, I am assuming it to be a variable named ROA.
So, here is my best guess/interpretation of what you want.
data RoaReg_lead;
merge RoaReg RoaReg(keep=ROA rename=(ROA=LeadROA) firstobs=2); /*merged the same table with only the ROA variable, and read the values from 2nd observation | can't use by variables in order to do so*/
Lag_co_id=lag(co_id); /*creating lagged values*/
Lag_ceo=lag(ceo);
/*conditions*/
if (RoaDLM ne . and RoaDLM>0) and co_id=Lag_co_id and ceo=Lag_ceo then
Lead1=LeadROA;
drop Lag_co_id Lag_ceo LeadROA; /*You can keep the vars to do a manual check*/
run;
Otherwise, providing a sample table of your data (have and want) would be very helpful.

Creating dummy for states with year effect

I have data of 18 states for 6 years(2009-2014).How can i create dummies which consider state and time effect simultaneously?
Without your data I have to assume a lot to answer this, but if I assume your state variable is a string and your year variable is numeric, then to create dummy variables for this I would put the two variables together and then encode them, like below:
tostring year, replace
gen state_year = state+year
encode state_year, gen(state_year_num)
and state_year_num is your indicator variable.
If you want a bunch of dummy variables you can add this line:
tabulate state_year_num, gen(dummy)
which will generate as many dummy variables as state-year pairs.

Stata : how to use variables as file name

I would like to use a variable (its value) as file name. Any ideas? Im using stata 14
Thanks a Lot in advance!
Per the comment from #toonice, please do give more details and I can better address your question.
However, you can use local macros to input into file names. Let's say you have a data set of a single variable x taking values of your filenames. You could loop through the data to save different files with the values of x. For example:
local N = _N
forvalues i = 1/`N' {
local myfilename x[`i']
// Insert code that changes data in some way to make files different
save ../output/`myfilename'_staticfilename.dta, replace
}
Give me more context and I am happy to provide more help.

How to store a mean value in a local macro and then save it in another file?

I have a Stata file file1.dta and one of the variables is income. I need to calculate average_income, assign it to a local macro, and store in a different Stata file, New.dta.
I have tried the following in a do file:
#delimit;
clear;
set mem 700m;
use file1.dta;
local average_income = mean income;
use New.dta;
gen avincome = average_income;
However, it does not work.
One way to do this would be the following:
#delimit;
clear;
set mem 700m;
use file1.dta;
quietly: summarize income;
local average_income = r(mean);
use New.dta;
gen avincome = `average_income';
This overlaps with your other post, namely How to retrieve data from multiple Stata files?. You don't say why you think
use file1.dta;
local average_income = mean income;
will work, but the second line is just fantasy syntax. There are various ways to calculate the mean of a variable, the most common being to use summarize and pick up the mean from r(mean).
You should probably delete this question: it serves no long-term purpose.