Stata drop observations command

Stata drop observations command - stata

I have panel data for two years for individuals (id). A dummy variable (empl) takes on three values (1,2,3). I'd like to keep only those ids which take on a value of 1. What Stata command should I use?

The code cited in the comments
by empl, sort: keep if empl == 1
is equivalent to
keep if empl == 1
and pays no attention to the panel structure.
What is wanted is to keep panels for which empl is always 1: if that is so, the minimum and maximum are always 1, so a criterion is
by id (empl), sort: keep if empl[1] == empl[_N] & empl[1] == 1
or
by id (empl), sort: keep if empl[1] == 1 & empl[_N] == 1

Related

Tableau check for multiple IDs across 2 groups

I want to create a set in tableau, which will show either one of these two values: Y or N
2 already existing columns are here important, "VAT-ID" and "CUSTOMER-ID". the new column should check if a customer-ID has multiple VAT-IDs. If yes, the value "Y" should be displayed, else "N".
The table looks like:
customer-id VAT-id in-both
123456 EE999999999 Y
654321 AA999999999 N
666666 GG999999999 N
123456 KK999999999 Y
654321 AA999999999 N
any Help would be appreciated, I have tried IF [CustomerID] = 1 AND Count([VAT-ID]) > 1 THEN 'Y' ELSE 'N' END which didn't work.

You are close. For this you need an LOD (Level of Detail) expression. LOD expressions allow you to do calculations at a different granularity then the view is rendered.
You can use:
if
{fixed [Customer-Id]: countd([VAT-id]) } > 1
then 'Y'
else 'N'
end
The LOD is the {fixed...}. The way you read this is you want to count the distinct number of VAT-id per each Customer-id. (eg 123456 will return 2; all others will return 1). Then you just wrap that in an If statement.

Looping through every value

I was trying to run a loop through a variable and was unsure how to code up my thoughts. So, I have variable called newid that goes as
newid
1
1
2
2
3
3
and so on.
foreach x in newid2 {
replace switchers = 1 if doc[_n] != doc[_n+1]
}
I want to modify this code so that this code will run for each two values (in this case run for 1 and 1, 2 and 2). What would be the best way to modify this? Please help me

Something like this can be done with levelsof:
clear
input id str1 doc
1 "A"
1 "B"
2 "A"
3 "C"
3 "A"
end
gen switcher1 = 0
levelsof id
foreach i in `r(levels)' {
quietly tab doc if id==`i'
replace switcher1 = 1 if r(r)>1 & id==`i'
}
However, you there are certainly more efficient ways to accomplish your goal. Here's one example that tags ids that switch doctors:
ssc install egenmore
bysort id: egen num_docs = nvals(doc)
generate switcher2 = cond(num_docs>1,1,0)
The underlying idea is the same. You count the number of distinct values of doc for each id. If that number exceeds one, the id is tagged as a switcher. The second version is arguably more efficient since it does not involve looping over each value of id.

How to write loop across Hierarchical Data (household-individual) in stata?

I'm now working on a household survey data set and I'd like to give certain members extra IDs according to their relationship to the household head. More specifically, I need to identify the adult children of household head and his/her spouse, if married, and assign them "sub-household IDs".
The variables are: hhid - household ID; pid -individual ID; relhead - relationship with head.
Regarding relhead, a 1 represents the head, a 6 represents a child, and a 7 represents a child-in-law. Below some example data, including in the last column the desired outcome. I assume that whenever a 6 is followed by a 7, they constitute a couple and belong to the same sub-household.
hhid pid relhead sub_hhid(desired)
50 1 1 1
50 2 3 1
50 3 6 2
50 4 6 3
50 5 7 3
-----------------------------------------------
67 1 1 1
67 3 6 2
67 4 7 2
Here are some thoughts:
There may be married and unmarried adult children within one household, the family structure is a little bit complicated, so I want to write some loop across the members in a household.
The basic idea is in the outer loop we identify the children staying-at-home and then check if there's a spouse presented, if there is, then we give the couple an indicator, if not, we continue and give the single stay_chil other indicator. After walking through all the possible members within a household, we get a series of within-household IDs. To facilitate further analysis , I need some kind of external ID variable to separate the sub-families.
* Define N as the total number of household, n as number of individual household size
* sty_chil is indicator for adult child who living with parents(head)
* sty_chil_sp is adult child's spouse
* "hid" and "ind_id" are local macros
forvalue hid=1/N {
forvalue ind_id= 1/n {
if sty_chil[`ind_id']==1 {
check if sty_chil_sp[`ind_id+1']==1 {
if yes then assign sub_hhid to this couples *a 6-7 pairs,identifid as couple
}
else { * single 6 identifid as single child
assign sub_hhid to this child
}
else { *Other relationships rather than 6, move forward
++ind_id the members within a household
}
++hid *move forward across households
}
The built-in stata by,sort: is pretty powerful but here I want to treat part of family members who fall into certain criterion and leave other untouched, so a if-else type loop is more natural for me (even by: may achieve my goal,it's always too tactful when situation become not so simple，and we cannot exhaust all the possible pattern of household pattern).
An immediate problem is that I don't know how to write loop across house IDs and individual IDs, because I used to acquire the household size (increment of outer loop) using by command (I'm not sure in this case it's 1 or the numerber of family members), and I'm not sure if mix up the by and if loops is a good programming practice, I favor write a "full loop" in this case. Please give me some clues how to achieve my goal and provide (illustrate)pseudo code for me.
An extra question is I cannot find the ado file which contains the content of by command, does it exist?

I will abstract from the issue of whether the assumption used to create matches is a sensible one or not. Rather, let this be an example of reaching the desired results without using explicit loops. Some logic and the use of subscripting (see help subscripting) can get you far.
clear
set more off
*----- example data -----
input ///
hhid pid relhead sub_hhid
50 1 1 1
50 3 6 2
50 4 6 3
50 5 7 3
67 1 1 1
67 3 6 2
67 4 7 2
67 5 6 3
end
list, sepby(hhid)
*----- what you want -----
bysort hhid (pid): gen hhid2 = sum( !(relhead == 7 & relhead[_n-1] == 6) )
list, sepby(hhid)
As you can see, one line of code gets you there. The reasoning is the following:
sum() gives the running sum. The arguments to sum(), being conditions, can either be True or False. The ! denotes the logical not (see help operators).
If it is not the case that the relationship is daughter/son-in-law AND the previous relationship is daughter/son, the condition evaluates to True and takes on the value of 1, increasing the running sum by 1. If it evaluates to False, meaning that the relationship is daughter/son-in-law AND the previous relationship is daughter/son, then it takes on the value of 0 and the running sum will not increase. This gives the result you seek.
You do this using the by: prefix, since you want to check each original household independently, so to speak.
For the the first observation of each original household, the condition always evaluates to True. This is because there exist no "previous" observation (relationship), and Stata considers relhead to be missing (., a very large number) and therefore, not equal to 6. This takes the running sum from 0 to 1 for the first observation of each sub-group, and so on.
Bottom line: learn how to use by: and take advantage of the features offered by Stata. Do not swim against the current; not here.
Edit
Please note that instead of progressively changing your example data set, you should provide a representative example from the beginning. Not doing so can render answers that are initially OK, completely inadequate.
For your modified example, add:
replace hhid2 = 1 if !inlist(relhead,6,7)
That will simply assign anyone not 6 or 7 to the same household as the head. The head is assumed to always have hhid2 == 1. If the head can have hhid2 != 1, then
bysort hhid (relhead): replace hhid2 = hhid2[1] if !inlist(relhead,6,7)
should work.
You can follow with:
bysort hhid (pid): replace hhid2 = hhid2[_n-1] + 1 if hhid2 != hhid2[_n-1] & _n > 1
but because they are IDs, it's not really necessary.
Finally, use:
gen hhid3 = string(hhid) + "_" + string(hhid2)
to create IDs with the form 50_1, 50_2, 50_3, etc.
Like I said before, if your data presents more complications, you should present a relevant example.

How do I calculate the maximum or minimum seen so far in a sequence, and its associated id?

From this Stata FAQ, I know the answer to the first part of my question. But here I'd like to go a step further. Suppose I have the following data (already sorted by a variable not shown):
id v1
A 9
B 8
C 7
B 7
A 5
C 4
A 3
A 2
To calculate the minimum in this sequence, I do
generate minsofar = v1 if _n==1
replace minsofar = min(v1[_n-1], minsofar[_n-1]) if missing(minsofar)
To get
id v1 minsofar
A 9 9
B 8 9
C 7 8
B 7 7
A 5 7
C 4 5
A 3 4
A 2 3
Now I'd like to generate a variable, call it id_min that gives me the ID associated with minsofar, so something like
id v1 minsofar id_min
A 9 9 A
B 8 9 A
C 7 8 B
B 7 7 C
A 5 7 C
C 4 5 A
A 3 4 C
A 2 3 A
Note that C is associated with 7, because 7 is first associated with C in the current sorting. And just to be clear, my ID variable here shows as a string variable just for the sake of readability -- it's actually numeric.
Ideas?
EDIT:
I suppose
gen id_min = id if _n<=2
replace id_min = id[_n-1] if v1[_n-1]<minsofar[_n-1] & missing(id_min)
replace id_min = id_min[_n-1] if missing(id_min)
does the job at least for the data in this example. Don't know if it would work for more complex cases.

This works for your example. It uses the user-written command vlookup, which you can install running findit vlookup and following through the link that appears.
clear
set more off
input ///
str1 id v1
A 9
B 8
C 7
B 7
A 5
C 4
A 3
A 2
end
encode id, gen(id2)
order id2
drop id
list
*----- what you want -----
// your code
generate minsofar = v1 if _n==1
replace minsofar = min(v1[_n-1], minsofar[_n-1]) if missing(minsofar)
// save original sort
gen osort = _n
// group values of v1 but respecting original sort so values of
// id2 don't jump around
sort v1 osort
// set obs after first as missing so id2 is unique within v1
gen v2 = v1
by v1: replace v2 = . if _n > 1
// lookup
vlookup minsofar, gen(idmin) key(v2) value(id2)
// list
sort osort
drop osort v2
list, sep(0)
Your code has generate minsofar = v1 if _n==1 which is better coded as generate minsofar = v1 in 1, because it is more efficient.
Your minsofar variable is just a displaced copy of v1, so if this is always the case, there should be simpler ways of handling your problem. I suspect your problem is easier than you have acknowledged until now, and that has come through your post. Perhaps giving more context, expanded example data, etc. could get you better advice.

This is both easier and a little more challenging than implied so far. Given value (a little more evocative than the OP's v1) and a desire to keep track of minimum so far, that's for example
generate min_so_far = value[1]
replace min_so_far = value if value < min_so_far[_n-1] in 2/L
where the second statement exploits the unsurprising fact that Stata replaces in the current order of observations. [_n-1] is the index of the previous observation and in 2/L implies a loop over all observations from the second to the last.
Note that the OP's version is buggy: by always looking at the previous observation, the code never looks at the very last value and will overlook that if it is a new minimum. It may be that the OP really wants "minimum before now" but that is not what I understand by "minimum so far".
If we have missing values in value they will not enter the comparison in any malign way: missing is always regarded as arbitrarily large by Stata, so missings will be recorded if and only if no non-missings are present so far, which is as it should be.
The identifier of that minimum at first sight yields to the same logic
generate min_so_far = value[1]
gen id_min = id[1]
replace min_so_far = value if value < min_so_far[_n-1] in 2/L
replace id_min = id if value < min_so_far[_n-1] in 2/L
There are at least two twists that might bite. The OP mentions a possibility that the identifier might be missing so that we might have a new minimum but not know its identifier. The code just given will use a missing identifier, but if the desire is to keep separate track of the identifier of the minimum value with known identifiers, different code is needed.
A twist not mentioned to date is that observations with different identifier might all have the same minimum so far. The code above replaces the identifier only the first time a particular minimum is seen; if the desire is to record the identifier of the last occurrence the < in the last code line above should be replaced with <=. If the desire is to keep track of the all the identifiers of the minimum so far, then a string variable is needed to concatenate all the identifiers.
With a structure of panel or longitudinal data the whole thing is done under the aegis of by:.
I can't see a need to resort to user-written extensions here.

Generating rolling z-scores of panel data in Stata

I have an unbalanced panel data set (countries and years). For simplicity let's say I have one variable, x, that I am measuring. The panel data sorted first by country (a 3-digit numeric country-code) and then by year. I would like to write a .do file that generates a new variable, z_x, containing the standardized values of the variable x. The variables should be standardized by subtracting the mean from the preceding (exclusive) m time periods, and then dividing by the standard deviation from those same time periods. If this is not possible, return a missing value.
Currently, the code I am using to accomplish this is the following (edited now for clarity)
xtset weocountrycode year
sort weocountrycode year
local win_len = 5 // Defining rolling window length.
quietly: rolling sd_x=r(sd) mean_x=r(mean), window(`win_len') saving(stats_x, replace): sum x
use stats_x, clear
rename end year
save, replace
use all_data_PROCESSED_FINAL.dta, clear
quietly: merge 1:1 (weocountrycode year) using stats_x
replace sd_x = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n] // This and next line are for deleting values that rolling calculates when I actually want missing values.
replace mean_`x' = . if `x'[_n-`win_len'+1] == . | weocountrycode[_n-`win_len'+1] != weocountrycode[_n]
gen z_`x' = (`x' - mean_`x'[_n-1])/sd_`x'[_n-1] // calculate z-score
UPDATE:
My struggle with rolling is that when rolling is set up to use a window length 5 rolling mean, it automatically does window length 1,2,3,4 means for the first, second, third and fourth entries (when there are not 5 preceding entries available to average out). In fact, it does this in general - if the first non-missing value is on entry 5, it will do a length 1 rolling average on entry 5, length 2 rolling average on entry 6, ..... and then finally start doing length 5 moving averages on entry 9. My issue is that I do not want this, so I would like to avoid performing these calculations. Until now, I have only been able to figure out how to delete them after they are done, which is both inefficient and bothersome.
I tried adding an if clause to the -rolling- statement:
quietly: rolling sd_x=r(sd) mean_x=r(mean) if x[_n-`win_len'+1] != . & weocountrycode[_n-`win_len'+1] != weocountrycode[_n], window(`win_len') saving(stats_x, replace): sum x
But it did not fix the problem and the output is "weird" in the sense that
1) If `win_len' is equal to, say, 10, there are 15 missing values in the resulting z_x variable, instead of 9.
2) Even though there are "extra" missing values in z_x, the observations still start out as window length 1 means, then window length 2 means, etc. which makes no sense to me.
Which leads me to believe I fundamentally don't understand 1) what -rolling- is doing and 2) how an if clause works in the context of -rolling-.
Does this help?
Thanks!

I'm not sure I understand completely but I'll try to answer based on what I think your problem is, and based on a comment by #NickCox.
You say:
... when rolling is set up to use a window length 5 rolling mean...
if the first non-missing value is
on entry 5, it will do a length 1 rolling average on entry 5, length 2
rolling average on entry 6, ...
This is expected. help rolling states:
The window size refers to calendar periods, not the number of
observations. If there
are missing data (for example, because of weekends), the actual number of observations used by command may be less than
window(#).
It's not actually doing a "length 1 rolling average", but I get to that later.
Below some examples to see what rolling does:
clear all
set more off
*-------------------------- example data -----------------------------
set obs 92
gen dat = _n - 1
format dat %tq
egen seq = fill(1 1 1 1 2 2 2 2)
tsset dat
tempfile main
save "`main'"
list in 1/12, separator(4)
*------------------- Example 1. None missing ------------------------
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------- Example 2. All but one value, missing in first window ------
use "`main'", clear
replace seq = . in 1/3
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
*------------- Example 3. All missing in first window --------------
use "`main'", clear
replace seq = . in 1/4
list in 1/8
rolling mean=r(mean), window(4) stepsize(4) clear: summarize seq, detail
list in 1/12, separator(0)
Note I use the stepsize option to make things much easier to follow. Because the date variable is in quarters, I set windowsize(4) and stepsize(4) so rolling is just computing averages by year. I hope that's easy to see.
Example 1 does as expected. No problem here.
Example 2 on the other hand, should be more interesting for you. We've said that what matters are calendar periods, so the mean is computed for the whole year (four quarters), even though it contains missings. There are three missings and one non-missing. summarize is computing the mean over the whole year, but summarize ignores missings, so it just outputs the mean of non-missings, which in this case is just one value.
Example 3 has missings for all four quarters of the year. Therefore, summarize outputs . (missing).
Your problem, as I understand it, is that when you face a situation like Example 2, you'd like the output to be missing. This is where I think Nick Cox's advice comes in. You could try something like:
rolling mean=r(mean) N=r(N), window(4) stepsize(4) clear: summarize seq, detail
replace mean = . if N != 4
list in 1/12, separator(0)
This says: if the number of non-missings for the window (r(N), also computed by summarize), is not the same as the window size, then replace it with missing.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js