I have a data set that looks like this:
id firm earnings A
1 A 100 0
1 A 200 0
2 B 50 1
2 B 70 1
3 C 900 0
bys id firm, I want to keep only the first observation if A==0 and want to keep all the observations if A==1.
I've tried the following code:
if A==0{
bys id firm: keep if _n==1
}
However, this code drops all the _n>1 observations no matter what the A value is.
The if (conditional) {do something} syntax is used in control flow rather than in defining variables. As you have it now Stata is only testing if A==1 in the first row. Try adding additional conditions using and (&) or or (|) statements. Try this:
bys id firm: keep if (_n==1 & A==0) | A==1
Related
I have a data set that looks like this:
id A
1 5
1 5
1 .
1 5
5 .
5 .
5 8
13 .
13 .
13 .
13 .
I want to calculate the number of A values when at least one A value is not missing in that panel in Stata.
For example, in the example above there are 3 missing values that are not the only missing value in that panel.
There is one missing A value when id is 1 and as there also are non-missing A values when id=1, I want to count that one.
Similarly, there are two missing A values when id is 5 and as there are also non-missing values when id=5, I want to count those two too.
There are 4 missing A values when id=13 but as there are no non-missing values when id=13, I don't want to count these.
I can't follow this, but the number of observations in each panel is
bysort id : gen count = _N
and the number of non-missing values of A is
by id : egen A_nm = count(A)
from which the missing values can be counted by subtraction. Alternatively, missing values can be counted directly by
by id: egen A_m = total(missing(A))
If that doesn't help, you may have to expand your question by showing what the new variable(s) you want looks like.
EDIT What you want may be just an application of this: you want to look at A_m values conditional on A_nm being positive.
I was confused by the following SAS code. So, here, the SAS data set named WORK.SALARY contains 10 observations for each department,and is currently ordered by Department. The following SAS program is submitted:
data WORK.TOTAL;
set WORK.SALARY(keep=Department MonthlyWageRate);
by Department;
if First.Department=1 then Payroll=0;
Payroll+(MonthlyWageRate*12);
if Last.Department=1;
run;
So, what exactly is First.Department and Last.Department? Many thanks for your time and attention.
Your data step calculates the total PAYROLL for each DEPARTMENT.
The FIRST. and LAST. variables are generated automatically when you use a BY statement. They are true when the current observation is the first (or last) observation in the BY group. How the DATA Step Identifies BY Groups
The sum statement (Syntax: var+expression;) for PAYROLL means that the value of PAYROLL is retained (or carried over) to the next observation.
The IF/THEN statement will initializes the value to zero when a new group starts.
The subsetting IF statement will make sure that only the final observation for each department is output.
As explained, it is calculating payroll for each department.
First.department assigns value =1 when a particular department id is encountered. last.department assigns a value =1 when the last record for the department is read.
So if you have :
Department Wage
1 100
1 200
1 300
2 1000
2 2000
2 3000
With the first. and last. assigned, it will look like this:
Department Wage first.deaprtment last.department
1 100 1 0
1 200 0 0
1 300 0 1
2 1000 1 0
2 2000 0 0
2 3000 0 1
Now you can follow your logic as to what happens when first.department = 1.
By the way, in your code, I dont see they are doing anything if Last.Department=1;
I have a situation where I need to need to order several dates to see if there is a gap in coverage. My data set looks like this, where id is the panel id and start and end are dates.
id start end
a 01.01.15 02.01.15
a 02.01.15 03.01.15
b 05.01.15 06.01.15
b 07.01.15 08.01.15
b 06.01.15 07.01.15
I need to identify any cases where there is a gap in coverage, meaning when the 2nd start date for an id is greater than the first end date for the same id. Also it should be noted that the same id can have undetermined number of observations and they might not be in a particular order. I wrote the code below for a case where there are only two observations per id.
bys id: gen y=1 if end < start[_n+1]
However, this code does not produce the desired results. I'm thinking that there should be another way to approach this problem.
Your approach seems sound in essence, assuming that your date variables are really Stata daily date variables formatted suitably. You don't explain at all what "does not produce the desired results" means to you.
The code below creates a sandbox similar to your example, but with string variables converted to daily dates.
Key details include:
Observations must be sorted by date within panel.
The end date for the observation after the last in each panel would always be returned as missing, and so as greater than any known date. The code here returns the corresponding indicator as missing.
clear
input str1 id str8 (s_start s_end)
a "01.01.15" "02.01.15"
a "02.01.15" "03.01.15"
b "05.01.15" "06.01.15"
b "07.01.15" "08.01.15"
b "06.01.15" "07.01.15"
b "10.01.15" "12.01.15"
end
foreach v in start end {
gen `v' = daily(s_`v', "DMY", 2050)
format `v' %td
}
// the important line here
bysort id (start) : gen first = end < start[_n+1] if _n < _N
list , sepby(id)
+----------------------------------------------------------+
| id s_start s_end start end first |
|----------------------------------------------------------|
1. | a 01.01.15 02.01.15 01jan2015 02jan2015 0 |
2. | a 02.01.15 03.01.15 02jan2015 03jan2015 . |
|----------------------------------------------------------|
3. | b 05.01.15 06.01.15 05jan2015 06jan2015 0 |
4. | b 06.01.15 07.01.15 06jan2015 07jan2015 0 |
5. | b 07.01.15 08.01.15 07jan2015 08jan2015 1 |
6. | b 10.01.15 12.01.15 10jan2015 12jan2015 . |
+----------------------------------------------------------+
I have a table which is like this:
Geo_Key Var1 Var2..Var50
123 1 0 .. 1
524 0 1 .. 1
323 1 1 .. 1
Where Var1-Var50 represents 50 columns having value 1/0.
I want to select count of distinct Geo_Key for each column(var1-var50), when its value is=1.
So Results would be like:
Var1 50
Var2 60
....
...
Var50 10
Since your variables are binary( especially 0/1) in nature, you can also try summing each column up. The sum would give you the count of each variable with value = 1.
Or, you can try it using proc freq. Pleae check the following link
http://www2.sas.com/proceedings/sugi25/25/btu/25p069.pdf
I think egen might help me here, but for whatever reason I can't quite figure out the right syntax. I'd like to create a new variable that takes a value of 1 for all observations in a group if, for any of the observations in the group, X is true. So, for example, my data has the obs, group, and flag variables, and I want to generate the variable grpflag.
obs group flag grpflag
1 1 0 1
2 1 1 1
3 1 0 1
4 2 0 0
5 2 0 0
6 2 0 0
7 3 1 1
8 3 0 1
So, in the example data, since flag==1 for one (i.e., any) of the observations in group 1, I want grpflag to take the value 1 for all observations in group 1. The same is true for group 3, and the opposite is true for group 2.
You were right: the egen command can do this.
egen grpflag = max(flag), by(group)
See the Stata FAQ http://www.stata.com/support/faqs/data-management/create-variable-recording/ for more detail on the correspondences any:maximum and all:minimum exploited in Stata.
Note that while your example is easy (flag is already 0 or 1, so max() can be applied directly to flag) the argument of max() can be an expression, so the syntax extends easily to more general cases, e.g. max(foo == 42).
Even if egen were not available, or did not work like this, this kind of one-liner is possible in Stata, and will be more efficient than calling egen:
bysort group (flag) : gen grpflag = flag[_N]
However, that would be thrown by missings on flag, so you would need to work around that. In turn that could just be
gen isflag = flag == 1
bysort group (isflag) : gen grpflag = isflag[_N]
The general principle is that so long as what you are sorting is just 0 and 1, any values of 1 will be sorted to the end within each block of observations.