Replacing variable entries to be the same in each group - stata

I'm working with panel data in Stata, and I have a set up like the following:
ID
year
value
1
2010
1
2011
20
1
2012
20
1
2013
1
2014
2
2010
2
2011
14
2
2012
14
2
2013
14
2
2014
14
and I want to change the blank entries to be the same as the other entries within that ID, for any year. I.e., I want something like the following:
ID
year
value
1
2010
20
1
2011
20
1
2012
20
1
2013
20
1
2014
20
2
2010
14
2
2011
14
2
2012
14
2
2013
14
2
2014
14
What do you recommend?

If the value in variable value are always the same within id you can use this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year byte value
1 2010 .
1 2011 20
1 2012 20
1 2013 .
1 2014 .
2 2010 .
2 2011 14
2 2012 14
2 2013 14
2 2014 14
end
*Get mean of values within id
bysort id : egen value2 = mean(value)
*Transfer values back to original var to maintain var labels etc. then drop value2
replace value = value2
drop value2

Related

How to complete (fill) a panel dataset with a time variable that has a delta > 1?

Say I have a bi-yearly panel only with observations at odd years, such as
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
and I would like to fill up the missing even years. Here years 2012 and 2014 is missing for all ids.
input id year var
1 2011 23
1 2012 .
1 2013 12
1 2014 .
1 2015 11
2 2011 44
2 2012 .
2 2013 42
2 2014 .
2 2015 13
end
I had a look at help expand but I am unsure that's what I need, since it does not take the by prefix.
As a background info, I need to fill up with even years to able to merge with another panel data-set conducted in even years only
You can set the panel id as id and the time variable as year and use tsfill:
clear
input id year var
1 2011 23
1 2013 12
1 2015 11
2 2011 44
2 2013 42
2 2015 13
end
xtset id year
tsfill
If the min and max year is not constant across panels, you could look at the ,full option.
. list
+-----------------+
| id year var |
|-----------------|
1. | 1 2011 23 |
2. | 1 2012 . |
3. | 1 2013 12 |
4. | 1 2014 . |
5. | 1 2015 11 |
|-----------------|
6. | 2 2011 44 |
7. | 2 2012 . |
8. | 2 2013 42 |
9. | 2 2014 . |
10. | 2 2015 13 |
+-----------------+

Keep individuals in the same firm by year (Stata)

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like this:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In the case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample and Id 3 and 4 for 2010.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
Any suggestions on how to perform this in Stata?
Regards,
bysort Id (Firm_id) : keep if Firm_id[1] == Firm_id[_N]
See FAQ here.

Filter specific observations

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample. Individual number 3 in 2011 and Individual 4 in 2011 would be dropped.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
This works for your data example:
clear
input Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
end
bysort Year Firm_id : keep if Id[1] != Id[_N]
sort Id Year
list

About keeping observation with specified criteria in SAS

Hello and many thanks in advance for your answers and efforts to help newby users in this forum.
i have a sas table with the variables : ID, Year, Month, and Creation date.
What i desire is, per month and year and Creation date to keep only one ID.
My HAVE data is :
ID Year Month Date of creation
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
My WANT data is
ID Year Month Date of creation
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
I tried nodup key but it removes ID's.
Your example seems to work fine with NODUPKEY option of PROC SORT. Perhaps you used the wrong BY variables?
data have;
input ID Year Month Creation $ ;
cards;
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
;
proc sort data=have out=want nodupkey;
by id year month creation ;
run;
You can also use distinct clause from proc sql, it will remove duplicates based on all columns
proc sql;
create table want
as
select distinct * from have;
quit;

python: obtaining a column of dates from the columns of years-months-days

Suppose I have a very simple dataframe:
>>> a
Out[158]:
monthE yearE dayE
0 10 2014 15
1 2 2012 15
2 2 2014 15
3 12 2015 15
4 2 2012 15
Suppose that I want to create the column with the date related to every line, using three columns of integers.
When I have simple numbers it is enough to do like:
>>> datetime.date(1983,11,8)
Out[159]: datetime.date(1983, 11, 8)
If I have to create a column of dates (theoretically a very basic request), instead:
a.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']))
I obtain the following error:
KeyError: ('yearE', u'occurred at index monthE')
I think you can first remove last char E and then use to_datetime, but then get pandas timestamps not python dates:
df.columns = df.columns.str[:-1]
df['date'] = pd.to_datetime(df)
#if multiple columns filter by subset
#df['date'] = pd.to_datetime(df[['year','month','day']])
print (df)
month year day date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
datetime64[ns]
print (df.date.iloc[0])
2014-10-15 00:00:00
print (type(df.date.iloc[0]))
<class 'pandas.tslib.Timestamp'>
Thank you MaxU for solution:
df['date'] = pd.to_datetime(df.rename(columns = lambda x: x[:-1]))
#if another columns in df
#df['date'] = pd.to_datetime(df[['yearE','monthE','dayE']].rename(columns=lambda x: x[:-1]))
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
But if really need python dates add axis=1 to apply, but then is impossible use some pandas functions:
df['date'] =df.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']), axis=1)
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
object
print (df.date.iloc[0])
2014-10-15
print (type(df.date.iloc[0]))
<class 'datetime.date'>