Quick question. I'm working with code that produces a spreadsheet that contains the information like the following:
year business sales profit
2001 a 5 3
2002 a 6 4
2003 a 4 2
2001 b 2 1
2002 b 6 3
2003 b 7 5
How can I get Stata to total sales and profits across years?
Thanks
Try
collapse (sum) sales profit, by(year)
or, if you want to retain your original data,
bysort year: egen tot_sales = total(sales)
egen stands for extended generate, a very useful command.
Related
I am working with a data set covering multiple countries, variables, and years. It is currently organized wide like so (actually ~30 years and 5 different variables for each country):
country measure yr1995 yr1996 yr1997
USA A 5 4 1
USA B 1 2 1
USA C 0 4 2
UK A 2 4 9
UK B 2 8 4
UK C 2 4 1
What I would like is for the data to be rearranged long like so:
country year A B C
USA 1995 5 1 0
USA 1996 4 2 4
USA 1997 1 1 2
UK 1995 2 2 2
UK 1996 4 8 4
UK 1997 9 4 1
I tried using reshape long yr, i(country) j(year) but get the following error message:
variable id does not uniquely identify the observations
Your data are currently wide. You are performing a reshape long. You specified i(country) and j(year). In
the current wide form, variable country should uniquely identify the observations.
I think this is because country is not the only long variable? (measure also is?)
Besides fixing that issue and arranging the years long instead of wide, I don't think this command will accomplish the other task of moving the different variables (A, B, C) into the wide format as column headers.
Will I need to use a separate reshape wide command for that? Or is there some way to expand the command to do both at once?
It's a double reshape. At least it can be done that way; and, further, that seems essential because years need to be long, not wide, and the measure(s) need to be wide, not long, so there are flavours of both problems.
Economic development data often arrive like this. Indeed the problem has given rise to at least one dedicated short paper
in the Stata Journal, but visible to all.
Your data example is helpful, and almost immediately useful, but please read the Stata tag and help dataex (if necessary, install dataex first using ssc install dataex).
See also this FAQ, which includes some hints beyond the Stata help and manual entry.
A search reshape in Stata would have pointed to these resources.
clear
input str3 country str1 measure yr1995 yr1996 yr1997
USA A 5 4 1
USA B 1 2 1
USA C 0 4 2
UK A 2 4 9
UK B 2 8 4
UK C 2 4 1
end
reshape long yr, i(country measure) j(year)
reshape wide yr, i(country year) j(measure) string
rename (yr*) *
list, sepby(country)
+----------------------------+
| country year A B C |
|----------------------------|
1. | UK 1995 2 2 2 |
2. | UK 1996 4 8 4 |
3. | UK 1997 9 4 1 |
|----------------------------|
4. | USA 1995 5 1 0 |
5. | USA 1996 4 2 4 |
6. | USA 1997 1 1 2 |
+----------------------------+
I have a dataset covering a number of companies for which there is a variable for the firms employees. Some years the number of employees has not been reported, hence a some years appear blank while the year before and after contains a value.
The data is similar to:
COMPANY YEAR NO. EMPLOYEES
Company 1 2007 4
Company 1 2008 5
Company 1 2009 5
Company 1 2010 5
Company 2 2007 11
Company 2 2008 10
Company 2 2009
Company 2 2010 10
Company 3 2007 3
Company 3 2008 4
Company 3 2009
Company 3 2010 3
I would like to be able to search the dataset for any such occurrences, making an indicator of these years, and afterwards replace any blank spots with the year before. If there is no previous year to use as a replacement or the previous year is blank, the year after the blank spot. I am hoping for the dataset to like:
COMPANY YEAR NO. EMPLOYEES
Company 1 2007 4
Company 1 2008 5
Company 1 2009 5
Company 1 2010 5
Company 2 2007 11
Company 2 2008 10
Company 2 2009 10
Company 2 2010 10
Company 3 2007 3
Company 3 2008 4
Company 3 2009 4
Company 3 2010 3
To sum up, at first i need to check whether or not i do have a problem with missing values in-between two years (important that the codes do not replace missing values before or after the last year with a non-missing value, since som firms exit the sample). Next, if any blank years in between any two years that are non-blank, I would like to replace these blank spots as mentioned above.
The method I would use:
1. Sort the dataset company/year.
2. Replace missing values using LAG function if the missing value is not the first observation of the company group.
3. Reverse the sort order
4. Repeat step 2 on the dataset with reversed order
5. Return the dataset to the original order
Please note, I have changed your original data for Company 3 in order to have a case for your second scenario (missing value, no previous record).
DATA HAVE;
input COMPANY $ 0-10 YEAR 13-17 N_EMPLOYEES 24-27;
datalines;
Company 1 2007 4
Company 1 2008 5
Company 1 2009 5
Company 1 2010 5
Company 2 2007 11
Company 2 2008 10
Company 2 2009
Company 2 2010 10
Company 3 2007
Company 3 2008 3
Company 3 2009 4
Company 3 2010 3
;
run;
PROC SORT DATA=HAVE
OUT=DOSOMEWORKHERE;
BY COMPANY YEAR;
RUN;
DATA DOSOMEWORKHERE (drop=PREV_N_EMPLOYEES);
set DOSOMEWORKHERE;
by COMPANY;
PREV_N_EMPLOYEES = LAG(N_EMPLOYEES);
if first.COMPANY then
do;
PREV_N_EMPLOYEES = .;
end;
if N_EMPLOYEES = . then N_EMPLOYEES = PREV_N_EMPLOYEES;
run;
PROC SORT DATA=DOSOMEWORKHERE
OUT=DOSOMEWORKHERE;
BY DESCENDING COMPANY DESCENDING YEAR ;
RUN;
DATA DOSOMEWORKHERE (drop=PREV_N_EMPLOYEES);
set DOSOMEWORKHERE;
by DESCENDING COMPANY;
PREV_N_EMPLOYEES = LAG(N_EMPLOYEES);
if first.COMPANY then
do;
PREV_N_EMPLOYEES = .;
end;
if N_EMPLOYEES = . then N_EMPLOYEES = PREV_N_EMPLOYEES;
run;
PROC SORT DATA=DOSOMEWORKHERE
OUT=WANT;
BY COMPANY YEAR;
RUN;
Result:
I have the following panel dataset.
I did
sort FirmID Year
to make the following.
FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
I want to create a new variable exitnextyear which is 1 if the firm does not exist next year, so that the output is
FirmID Year exitnextyear
1 1996 0
1 1997 0
1 1998 1
2 2000 0
2 2001 1
I think I have to use something like
by FirmID: gen exitnextyear (and something)
but I don't know what to do next.
clear
input FirmID Year
1 1996
1 1997
1 1998
2 2000
2 2001
end
bysort FirmID (Year) : gen byte exitnextyear = _n == _N
list, sepby(FirmID)
For the principles, see help and manual entries on by: and/or a tutorial review accessible here.
Row is spreadsheetspeak; in Stata the term is observation.
I'm looking to build flags for students who have repeated a grade, skipped a grade, or who have an unusual grade progression (e.g. 4th grade in 2008 and 7th grade in 2009). My data is unique at the student id-year-subject level and structured like this (albeit with more variables):
id year subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
This is the code that I've used:
sort id year grade
gen repeat_flag = .
replace repeat_flag = 1 if year!=year[_n+1] & grade==grade[_n+1] ///
& subject!=subject[_n+1] & id==id[_n+1]
replace repeat_flag = 0 if repeat_flag==.
One problem is that there are a lot of students who took a test in say 6 grade, didn't take one in 7th and then took one in 8th grade. This varies across years and school districts, as certain school districts adopted tests in different years for different grade levels. My code doesn't account this.
Regardless though I think there must be more elegant ways to do this and as a side note I wanted to know if the use of indexes is appropriate for a problem like this. Thanks!
Edit
Included a sample of what my data looks like above in response to one of the comments below. If still not clear any feedback is welcomed.
What may seem anomalous are students progressing faster or more slowly in tested grade than the passage of time would imply. That's possibly just one line for the grunt work:
clear
input id year str1 subject tested_grade
1 2011 m 10
1 2012 m 11
1 2013 m 12
2 2011 r 4
2 2012 r 7
2 2013 r 8
3 2011 m 6
3 2013 m 8
end
bysort id (year) : gen flag = (tested - tested[_n-1]) - (year - year[_n-1])
list if flag != 0 & flag < . , sepby(id)
+---------------------------------------+
| id year subject tested~e flag |
|---------------------------------------|
5. | 2 2012 r 7 2 |
+---------------------------------------+
I have the following dataset (individual level data):
pid year state income
1 2000 il 100
2 2000 ms 200
3 2000 al 30
4 2000 dc 400
5 2000 ri 205
1 2001 il 120
2 2001 ms 230
3 2001 al 50
4 2001 dc 400
5 2001 ri 235
.........etc.......
I need to estimate average income for each state in each year and create a new dataset that would look like this:
state year average_income
ar 2000 150
ar 2001 200
ar 2002 250
il 2000 150
il 2001 160
il 2002 160
...........etc...............
I already have a code that runs perfectly fine (I have two loops). However, I would like to know is there any better way in Stata like sql style query?
This is shorter code than any suggested so far:
collapse average_income=income, by(state year)
This shouldn't need 2 loops, or any for that matter. There are in fact more efficient ways to do this. When you are repeating an operation on many groups, the bysort command is useful:
bysort year state: egen average_income = mean(income)
You also don't have to create a new dataset, you can just prune this one and save it. Start by only keeping the variables you want (state, year and average_income) and get rid of duplicates:
keep state year average_income
duplicates drop
save "mynewdataset.dta"
You have the SQL tag on the question. This is a basic aggregation query in SQL:
select state, year, avg(income) as average_income
from t
group by state, year;
To put this in a table, depends on your database. One of the following typically works:
create table NewTable as
select state, year, avg(income) as average_income
from t
group by state, year;
Or:
select state, year, avg(income) as average_income
into NewTable
from t
group by state, year;