Filter specific observations

Filter specific observations - stata

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample. Individual number 3 in 2011 and Individual 4 in 2011 would be dropped.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010

This works for your data example:
clear
input Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
end
bysort Year Firm_id : keep if Id[1] != Id[_N]
sort Id Year
list

Related

Replacing variable entries to be the same in each group

I'm working with panel data in Stata, and I have a set up like the following:
ID
year
value
1
2010
1
2011
20
1
2012
20
1
2013
1
2014
2
2010
2
2011
14
2
2012
14
2
2013
14
2
2014
14
and I want to change the blank entries to be the same as the other entries within that ID, for any year. I.e., I want something like the following:
ID
year
value
1
2010
20
1
2011
20
1
2012
20
1
2013
20
1
2014
20
2
2010
14
2
2011
14
2
2012
14
2
2013
14
2
2014
14
What do you recommend?

If the value in variable value are always the same within id you can use this:
* Example generated by -dataex-. For more info, type help dataex
clear
input byte id int year byte value
1 2010 .
1 2011 20
1 2012 20
1 2013 .
1 2014 .
2 2010 .
2 2011 14
2 2012 14
2 2013 14
2 2014 14
end
*Get mean of values within id
bysort id : egen value2 = mean(value)
*Transfer values back to original var to maintain var labels etc. then drop value2
replace value = value2
drop value2

Keep individuals in the same firm by year (Stata)

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like this:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In the case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample and Id 3 and 4 for 2010.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
Any suggestions on how to perform this in Stata?
Regards,

bysort Id (Firm_id) : keep if Firm_id[1] == Firm_id[_N]
See FAQ here.

About keeping observation with specified criteria in SAS

Hello and many thanks in advance for your answers and efforts to help newby users in this forum.
i have a sas table with the variables : ID, Year, Month, and Creation date.
What i desire is, per month and year and Creation date to keep only one ID.
My HAVE data is :
ID Year Month Date of creation
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
My WANT data is
ID Year Month Date of creation
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
I tried nodup key but it removes ID's.

Your example seems to work fine with NODUPKEY option of PROC SORT. Perhaps you used the wrong BY variables?
data have;
input ID Year Month Creation $ ;
cards;
1 2019 1 a
1 2019 1 a
1 2019 1 b
1 2019 2 c
1 2019 3 d
1 2020 5 e
2 2019 1 a
2 2019 1 b
2 2019 3 c
3 2021 8 m
3 2021 9 k
;
proc sort data=have out=want nodupkey;
by id year month creation ;
run;

You can also use distinct clause from proc sql, it will remove duplicates based on all columns
proc sql;
create table want
as
select distinct * from have;
quit;

Extract weeks from datetime (Python Pandas)

I have a dataframe:
time year month
0 12/28/2013 0:17 2013 12
1 12/28/2013 0:20 2013 12
2 12/28/2013 0:26 2013 12
3 12/29/2013 0:20 2013 12
4 12/29/2013 0:26 2013 12
5 12/30/2013 0:31 2013 12
6 12/30/2013 0:31 2013 12
7 12/31/2013 0:17 2013 12
8 12/31/2013 0:20 2013 12
9 12/31/2013 0:26 2013 12
10 1/1/2014 4:30 2014 1
11 1/1/2014 4:34 2014 1
12 1/1/2014 4:37 2014 1
13 1/2/2014 4:30 2014 1
14 1/2/2014 5:30 2014 1
15 1/3/2014 4:30 2014 1
16 1/3/2014 4:34 2014 1
17 1/3/2014 4:37 2014 1
18 1/4/2014 4:30 2014 1
19 1/4/2014 4:34 2014 1
20 1/4/2014 4:37 2014 1
I use the following code to extract the week information:
df['week'] = df['time'].dt.week
This makes the dataframe as following:
time year month week
0 2013-12-28 00:17:00 2013 12 52
1 2013-12-28 00:20:00 2013 12 52
2 2013-12-28 00:26:00 2013 12 52
3 2013-12-29 00:20:00 2013 12 52
4 2013-12-29 00:26:00 2013 12 52
5 2013-12-30 00:31:00 2013 12 1
6 2013-12-30 00:31:00 2013 12 1
7 2013-12-31 00:17:00 2013 12 1
8 2013-12-31 00:20:00 2013 12 1
9 2013-12-31 00:26:00 2013 12 1
10 2014-01-01 04:30:00 2014 1 1
11 2014-01-01 04:34:00 2014 1 1
12 2014-01-01 04:37:00 2014 1 1
13 2014-01-02 04:30:00 2014 1 1
14 2014-01-02 05:30:00 2014 1 1
15 2014-01-03 04:30:00 2014 1 1
16 2014-01-03 04:34:00 2014 1 1
17 2014-01-03 04:37:00 2014 1 1
18 2014-01-04 04:30:00 2014 1 1
19 2014-01-04 04:34:00 2014 1 1
20 2014-01-04 04:37:00 2014 1 1
I would like to create another column showing year-week (e.g., 2013-52, 2014-1). The problem is when I combine two columns (year, week) in rows 5 through 9, the result is 2013-1 saying the first week of 2013. This is not correct. Is there any solution for this issue?

Use dt.strftime
reference http://strftime.org/
df.time.dt.strftime('%Y-%W')
0 2013-51
1 2013-51
2 2013-51
3 2013-51
4 2013-51
5 2013-52
6 2013-52
7 2013-52
8 2013-52
9 2013-52
10 2014-00
11 2014-00
12 2014-00
13 2014-00
14 2014-00
15 2014-00
16 2014-00
17 2014-00
18 2014-00
19 2014-00
20 2014-00
Name: time, dtype: object

As #TrigonaMinima pointed out, the first week of the year as defined by ISO 8601 (which dt.week follows):
It is the first week with a majority (4 or more) of its days in
January
In your case, week = 1 has 2 days in December and the rest in January, thus fitting the definition of the first week.

python: obtaining a column of dates from the columns of years-months-days

Suppose I have a very simple dataframe:
>>> a
Out[158]:
monthE yearE dayE
0 10 2014 15
1 2 2012 15
2 2 2014 15
3 12 2015 15
4 2 2012 15
Suppose that I want to create the column with the date related to every line, using three columns of integers.
When I have simple numbers it is enough to do like:
>>> datetime.date(1983,11,8)
Out[159]: datetime.date(1983, 11, 8)
If I have to create a column of dates (theoretically a very basic request), instead:
a.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']))
I obtain the following error:
KeyError: ('yearE', u'occurred at index monthE')

I think you can first remove last char E and then use to_datetime, but then get pandas timestamps not python dates:
df.columns = df.columns.str[:-1]
df['date'] = pd.to_datetime(df)
#if multiple columns filter by subset
#df['date'] = pd.to_datetime(df[['year','month','day']])
print (df)
month year day date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
datetime64[ns]
print (df.date.iloc[0])
2014-10-15 00:00:00
print (type(df.date.iloc[0]))
<class 'pandas.tslib.Timestamp'>
Thank you MaxU for solution:
df['date'] = pd.to_datetime(df.rename(columns = lambda x: x[:-1]))
#if another columns in df
#df['date'] = pd.to_datetime(df[['yearE','monthE','dayE']].rename(columns=lambda x: x[:-1]))
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
But if really need python dates add axis=1 to apply, but then is impossible use some pandas functions:
df['date'] =df.apply(lambda x: datetime.date(x['yearE'],x['monthE'],x['dayE']), axis=1)
print (df)
monthE yearE dayE date
0 10 2014 15 2014-10-15
1 2 2012 15 2012-02-15
2 2 2014 15 2014-02-15
3 12 2015 15 2015-12-15
4 2 2012 15 2012-02-15
print (df.date.dtypes)
object
print (df.date.iloc[0])
2014-10-15
print (type(df.date.iloc[0]))
<class 'datetime.date'>

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Filter specific observations - stata

This works for your data example: clear input Id Firm_id Year 1 50 2010 1 50 2011 2 50 2010 2 50 2011 3 22 2010 3 22 2011 4 22 2010 4 20 2011 end bysort Year Firm_id : keep if Id[1] != Id[_N] sort Id Year list

Related

Replacing variable entries to be the same in each group

Keep individuals in the same firm by year (Stata)

About keeping observation with specified criteria in SAS

Extract weeks from datetime (Python Pandas)

python: obtaining a column of dates from the columns of years-months-days

Categories

Resources