Stata: Removing Non-unique duplicates

Stata: Removing Non-unique duplicates - stata

I want to retain a copy of each company-year observation considering my subyear_total variable in my data.
Some of my data has multiple entries for any given year as noted by copies.
Copies was created by:
bysort cik year: gen copies = _N
How can I remove the duplicates but keep one copy of the unique observation?
* Example generated by -dataex-. To install: ssc install dataex
clear
input int year long cik float(subyear_total copies)
1999 1750 425000 1
2005 1750 4232000 1
2006 1750 1.60e+07 1
2007 1750 182444 3
2007 1750 182444 3
2007 1750 182444 3
2008 1750 710909 3
2008 1750 710909 3
2008 1750 710909 3
2009 1750 5155390 5
2009 1750 5155390 5
2009 1750 5155390 5
2009 1750 5155390 5
2009 1750 5155390 5
end
So for example:
2007 has 3 entries and I want to keep one of those and drop the rest. Same thing for 2008 and 2009 (which has 5 entries).
I if do drop if copies > 1 would I lose all instances of those years? How can I keep at least one?

The duplicates could be used here, but in your case
bysort year cik : keep if _n == 1
gets you there directly. The variable copies is then of no obvious use.

You want to use _n instead of _N in your code to assign groupwise ids, like:
bysort cik year: gen copies = _n
Then drop observations with copies greater one:
drop if copies > 1

Related

Inactivity duration variable in panel data (Stata)

I have a dataset for U.S. manufacturing workers in the past 30 decades, and I am particularly interested in the following variables:
Month and year of 1st manufacturing job, recorded separately and named "start_month_job_1" & "start_yr_job_1."
Month and year of leaving the 1st manufacturing job, recorded separately and named "end_month_job_1" & "end_yr_job_1."
The reason for leaving the job (e.g. retirement, firing, factory shutdown, etc.), named "leaving_reason"
Month and year of 2nd manufacturing job, recorded separately and named "start_month_job_2" & "start_yr_job_2."
Month and year of leaving the 2nd manufacturing job, recorded separately and named "end_month_job_2" & "end_yr_job_2."
I am trying to create a variable that measures the duration of economic inactivity/idleness. I am defining "duration of economic inactivity" this as the time difference between leaving a 1st job and starting another job. I have created a variable that accomplishes that with years as in below:
gen econ_inactivity_duration_1 = start_yr_job_2 - end_yr_job_1
replace econ_inactivity_1 = 2018 - end_yr_job_1 if missing(start_yr_job_2 ) /// In cases where a worker never starts a second job until 2018, which is the latest year measured in the survey.
However, I want to actually create an economic_inactivity_duration variable that takes into account the difference in month and year, for both starting and leaving a job, respectively. For instance, the duration for the worker in row 1 would be 2 months, between May, 1993 and July, 1993, as opposed to zero, which is what my code above computes.
dataex start_month_job_1 byte start_yr_job_1 byte end_month_job_1 byte end_yr_job_1 byte start_month_job_2 byte start_yr_job_2 byte end_month_job_2 byte end_yr_job_2 byte leaving_reason
3 1990 5 1993 7 1993 4 1994 "Firm shutdown"
1 2003 7 2015 . . . . "job automation"
98 1979 98 2004 . . . . "Firm shutdown"
98 1975 98 2010 98 2010 98 2015 "job automation"
1 1983 12 1985 1 1986 . . "Firm shutdown"
98 1996 98 1998 . . . . "Firm shutdown"

There is probably a better way, but here is a crude method.
* Data example
input end_month_job_1 end_yr_job_1 start_month_job_2 start_yr_job_2
5 1993 7 1993
end
* Calculate months since 1960
gen j1_end = (end_yr_job_1 - 1960) * 12 + end_month_job_1
gen j2_start = (start_yr_job_2 - 1960) * 12 + start_month_job_2
* Calculate difference
gen wanted = j2_start - j1_end
* Check difference is positive
assert wanted > 0
list
+------------------------------------------------------------------------+
| end_mo~1 end_yr~1 s~mont~2 s~yr_j~2 j1_end j2_start wanted |
|------------------------------------------------------------------------|
1. | 5 1993 7 1993 401 403 2 |
+------------------------------------------------------------------------+

Keep individuals in the same firm by year (Stata)

I have an employer-employee database and need to keep only the individuals that have at least one colleague considering the Firm_id variable, but I don't know how to do this in Stata. My dataset is like this:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
3 22 2011
4 22 2010
4 20 2011
In the case above, I would keep only the individuals corresponding to the Id 1 and 2 because they are in the same firm in both of the years in the sample and Id 3 and 4 for 2010.
The output I'm looking for is like:
Id Firm_id Year
1 50 2010
1 50 2011
2 50 2010
2 50 2011
3 22 2010
4 22 2010
Any suggestions on how to perform this in Stata?
Regards,

bysort Id (Firm_id) : keep if Firm_id[1] == Firm_id[_N]
See FAQ here.

Calculating the amount of Board Turnover

I have been trying to calculate the amount of turnover happening in exective boards between 2006 and 2009 in the financial sector.
For this I have data looking like the following:
Year Bank Director DirectorID (ISIN, RoA, Size etc)
2005 Bank1 John Smith 120
2005 Bank1 Barry Pooter 160
2005 Bank1 Jack Sparrow 2070
2006 Bank1 John Smith 120
2006 Bank1 Barry Pooter 160
2006 Bank1 Jack Sparrow 2070
2007 Bank1 John Smith 120
2007 Bank1 Barry Pooter 160
2007 Bank1 Jack Sparrow 2070
2008 Bank1 John Smith 120
2008 Bank1 Carla Jansen 250
2008 Bank1 Jack Sparrow 2070
2009 Bank1 John Smith 160
2009 Bank1 Carla Jansen 250
2009 Bank1 Mike Stata 875
And this data repeats for each bank from 2005 - 2015.
Now I have already made a turnover dummy variable with 0 = no change and 1 = change by using:
collapse(sum) DirectorID, by (ISIN, Year, Bank)
gen interest = inrange(Year, 2006,2009)
bysort ID interest (DirectorID) : gen temp = DirectorID[1] != DirectorID[_N]
replace temp = . if interest==0
bysort ID : egen changed = max(temp)
However, I would like to make turnover an actual variable on how many changes were made i.e.: (assume bank2 made no change Turnover=0, bank3 made 6 changes (6 new managers came in)Turnover=6 and bank4 made 4 changes (4 new managers came in)Turnover=4.
Bank Turnover (ISIN, RoA, Size, etc)
Bank1 2
Bank2 0
Bank3 6
Bank4 4
Is this possible with Stata (or SPSS if that happens to be the case)?
ISIN codes are my ID variable as they are linked to each specific bank.
Two new people entered the board of Bank1. For now it would show as Turnover = 2 as only 2 new people entered the organization's board. Had three people joined in the previous example, in that case Turnover = 3 as each change made to the Board counts as "+1" turnover regardless of the people leaving. Only people that join (whether they replace someone or are just an addition to the board) are of interest in my thesis.
However, this could also be calculated differently if that makes it easier. Depends on how I write my methodology. It would be fine if the variable turnover says how many changes were made per year i.e. Turnover2005: 2005 - 2006, Turnover2006: 2006 - 2007, Turnover2007 2007- 2008 and Turnover2008 2008 - 2009
Finally, it's possible that TMTs grow, i.e. 2005 bank 1 has 14 managers on the board and in 2006 they hire 3 new managers but only let 1 go. Now the board has 16 managers and made 3 changes (3 new managers)

This might help. The following code builds a dataset consisting of data with four banks and five years. It is panel data. The xtset command lets you use time series operators which are well documented here (https://www.youtube.com/watch?v=ik8r4WvrPkc). (Note: for sake of clear exposition, in this example Bank 1 had no changes, Bank 2 had two changes, Bank 3 had three, etc.).
// Clear the session and other memory.
set more off
clear all
// Input reproducible data.
input year bank_num ceo_num
2005 1 200
2006 1 200
2007 1 200
2008 1 200
2009 1 200
2005 2 222
2006 2 222
2007 2 222
2008 2 333
2009 2 444
2005 3 300
2006 3 301
2007 3 302
2008 3 302
2009 3 303
2005 4 999
2006 4 888
2007 4 777
2008 4 666
2009 4 555
end
// Declare the panel structure.
xtset bank_num year
// Gen variable indicating if ceo_num stayed same.
// Resulting variable is 0 when there was no change.
gen no_turn = (ceo_num - f1.ceo_num)
// Gen dummy to indicate if ceo_num changed.
gen is_turn = (no_turn != 0 & no_turn < .)
// Gen a variable that counts changes.
egen turn_nums = sum(is_turn), by(bank_num)
// List data to inspect results.
list
Edit: Re-characterized comment for no_turn variable.

Creating a dummy indicating the last row of each group

I have the following panel dataset.
I did
sort FirmID Year
to make the following.
FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
I want to create a new variable exitnextyear which is 1 if the firm does not exist next year, so that the output is
FirmID Year exitnextyear
1 1996 0
1 1997 0
1 1998 1
2 2000 0
2 2001 1
I think I have to use something like
by FirmID: gen exitnextyear (and something)
but I don't know what to do next.

clear
input FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
end
bysort FirmID (Year) : gen byte exitnextyear = _n == _N
list, sepby(FirmID)
For the principles, see help and manual entries on by: and/or a tutorial review accessible here.
Row is spreadsheetspeak; in Stata the term is observation.

Adding across years

Quick question. I'm working with code that produces a spreadsheet that contains the information like the following:
year business sales profit
2001 a 5 3
2002 a 6 4
2003 a 4 2
2001 b 2 1
2002 b 6 3
2003 b 7 5
How can I get Stata to total sales and profits across years?
Thanks

Try
collapse (sum) sales profit, by(year)
or, if you want to retain your original data,
bysort year: egen tot_sales = total(sales)
egen stands for extended generate, a very useful command.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js

Stata: Removing Non-unique duplicates - stata

The duplicates could be used here, but in your case bysort year cik : keep if _n == 1 gets you there directly. The variable copies is then of no obvious use.

You want to use _n instead of _N in your code to assign groupwise ids, like: bysort cik year: gen copies = _n Then drop observations with copies greater one: drop if copies > 1

Related

Inactivity duration variable in panel data (Stata)

Keep individuals in the same firm by year (Stata)

Calculating the amount of Board Turnover

Creating a dummy indicating the last row of each group

Adding across years

Categories

Resources