I have data with IDs which may or may not have all values present. I want to delete ONLY the observations with no data in them; if there are observations with even one value, I want to retain them. Eg, if my data set is:
ID val1 val2 val3 val4
1 23 . 24 75
2 . . . .
3 45 45 70 9
I want to drop only ID 2 as it is the only one with no data -- just an ID.
I have tried Statalist and Google but couldn't find anything relevant.
This will also work with strings as long as they are empty:
ds id*, not
egen num_nonmiss = rownonmiss(`r(varlist)'), strok
drop if num_nonmiss == 0
This gets a list of variables that are not the id and drops any observations that only have the id.
Brian Albert Monroe is quite correct that anyone using dropmiss (SJ) needs to install it first. As there is interest in varying ways of solving this problem, I will add another.
foreach v of var val* {
qui count if missing(`v')
if r(N) == _N local todrop `todrop' `v'
}
if "`todrop'" != "" drop `todrop'
Although it should be a comment under Brian's answer, I will add here a comment here as (a) this format is more suited for showing code (b) the comment follows from my code above. I agree that unab is a useful command and have often commended it in public. Here, however, it is unnecessary as Brian's loops could easily start something like
foreach v of var * {
UPDATE September 2015: See http://www.statalist.org/forums/forum/general-stata-discussion/general/1308777-missings-now-available-from-ssc-new-program-for-managing-missings for information on missings, considered by the author of both to be an improvement on dropmiss. The syntax to drop observations if and only if all values are missing is missings dropobs.
Just another way to do it which helps you discover how flexible local macros are without installing anything extra to Stata. I rarely see code using locals storing commands or logical conditions, though it is often very useful.
// Loop through all variables to build a useful local
foreach vname of varlist _all {
// We don't want to include ID in our drop condition, so don't execute the remaining code if our loop is currently on ID
if "`vname'" == "ID" continue
// This local stores all the variable names except 'ID' and a logical condition that checks if it is missing
local dropper "`dropper' `vname' ==. &"
}
// Let's see all the observations which have missing data for all variables except for ID
// The '1==1' bit is a condition to deal with the last '&' in the `dropper' local, it is of course true.
list if `dropper' 1==1
// Now let's drop those variables
drop if `dropper' 1==1
// Now check they're all gone
list if `dropper' 1==1
// They are.
Now dropmiss may be convenient once you've downloaded and installed it, but if you are writing a do file to be used by someone else, unless they also have dropmiss installed, your code won't work on their machine.
With this approach, if you remove the lines of comments and the two unnecessary list commands, this is a fairly sparse 5 lines of code which will run with Stata out of the box.
Related
The example below reproduces my problem. There is a string variable which takes several values. I want to create a global list and iterate over it in a loop. But it does not work. I've tried several versions without success. Here is the example code:
webuse auto, clear
levelsof make // list of car makes
global MAKE r(levels) // save levels in global list
foreach i in $MAKE { // loop some command over saved list
sum if make == "`$MAKE'" // ERROR 198, invalid 'Concord'
}
Using "`$MAKE'" or $MAKE does not yield desired output.
Any ideas of what am I doing wrong?
Normally, for lists to work, they should be saved as in A B C D [...]. In my case, levelsof produces a list of the following kind:
di $MAKE
`"AMC Concord"' `"AMC Pacer"' `"AMC Spirit"' `"Audi 5000"' `"Audi Fox"' `"BMW 320i"' [...]
So clearly not what is needed. But not sure how to get what I need.
Here is a solution. Note that I am using a local instead of a global. The difference is only scope. Only use global if you need to reference the value across do-files. You can remove the display lines below.
*Sysuse reads this data from disk, it comes with all Stata installations
sysuse auto, clear
*Use levelsof, and assign the returned r(levels) using a = to the local
levelsof make
local all_makes = r(levels)
*Loop over the local like this. Note that foreach creates a local, in this
*case called this_make that stores the elements in the local one per iteration
foreach this_make of local all_makes {
display "`this_make'"
sum if make == "`this_make'"
}
If global is what you need, then you simply change it to this:
*Sysuse reads this data from disk, it comes with all Stata installations
sysuse auto, clear
*Use levelsof, and assign the returned r(levels) using a = to the global
levelsof make
global all_makes = r(levels)
*Loop over the global like this. Note that foreach creates a local, in this
*case called this_make that stores the elements in the global one per iteration
foreach this_make of global all_makes {
display "`this_make'"
sum if make == "`this_make'"
}
There is a fine accepted answer but plenty more can be said. See for example this FAQ.
I am positive about levelsof as its original author, but for the purpose specified, to loop over the levels of a variable, it can be a lot cleaner to use egen, group() and loop over the integer levels of that variable. See the FAQ just linked for more. The example in the original question is a case in point, as looping over distinct string values can be tricky with a need to use double quotes " " and to watch out for spaces and so forth.
The underlying problem is not revealed but an extra comment is to underline that by: and its sibling commands such as statsby or commands similar in spirit such as rangestat from SSC offer, in effect, looping without looping.
Using this excellent advice from Statalist, I am running a loop to read in a 60GB Stata dataset and save it in chunks (after some data preprocessing).
Unfortunately, I do not know the total number of observations and so the use command does not execute when asking to read in more data than is available:
use `usevars' in 210000001/220000000 using "a_large_dta_file.dta", clear
The dataset appears to contain less than 220000000 observations, but I do not know how many. I am looking for an endoffile operator or something in that spirit to circumvent this problem. Manually I verified that at least 210001001 exist, but this will not help much.
Consider the following reproducible example using Stata's auto toy dataset:
sysuse auto, clear
display _N
74
Using the describe command will get you what you want:
findfile auto.dta
describe using "`r(fn)'" // or ask for only one variable e.g. describe rep78
display r(N)
74
Stata datasets are always rectangular so you can also manually load a single variable and count:
use rep78 using "`r(fn)'", clear // load a variable which also contains missing data
display _N
74
Alternatively, use a loop to load smaller chunks and the capture command to see where it fails.
I have two lines of data,
Order
17/01/2016
01/02/2014
Basically I want to run a logic like so;
data A.test_active;
set A.Weekly_Email_files_cleaned4;
length active :8.;
length inactive :8.;
if first.Order between '01Jan2014'd and '31Dec2015'd then active= 1;
if last.order between '01Jan2014'd and '31Dec2015'd then inactive= 1;
run;
the field "Order" is formatted by DDMMYY10 when I checked the file properties, but I keep getting this error
ERROR 388-185: Expecting an arithmetic operator.
Can anyone help or suggest something different in the same vain?
In SAS, between is only valid in SQL contexts: either actual PROC SQL, or WHERE statements, generally. It is not otherwise valid in SAS. You would use in (firstval:lastval) instead, if those values are integers (dates are). If they're not integers, you need to use if firstval le val le lastval or similar (can also use ge/lt/gt/>/< or whatever you like, depending on the ordering of things).
Second, first.order and last.order are boolean values - 1 or 0, nothing else, that indicate that you are on a row that is the first row for a new value when sorted by that variable, or the last row similarly. You also must have a by statement by that variable if you're going to use them.
Third, your length statements are wrong; you're confusing some three different things here, I think. Length statements for numerics aren't needed if you're using default length 8, and if you do like having them anyway, you need:
length active 8;
No : or ., both are used for different purposes.
ID first_order Order
alex 01/01/2013 23/01/2015
alex 01/01/2013 23/01/2015
alex 01/01/2013 03/04/2013
basically if an order exists after the first order that is within a certain timeframe (within a year of the date of the first order) then the user is "active"
any ideas much appreciated
thanks
I have a master data file that contains responses from English, German, and French respondents. The open-ended responses (OER) were sent to translators and they sent us back a file with the original OER and English translation of those. Now I want to replace the "empty" columns reserved for English translation in my master data with the new information.
My approach was:
Create a loop in the translation file:
foreach var of varlist *englishtranslation* {
rename `var' new_`var'
}
Then merge new_`var' into master data using respondent ID.
Replace non-missing cases in blank cols using info in new_`var'.
Drop new_`var'.
However, Stata keeps saying that the new variable names new_`var' are invalid:
You attempted to rename q12_v1_995_oe_englishtranslation to
new_q12_v1_995_oe_englishtranslation. That is an invalid Stata
variable name.
Do you have any recommendation on fixing that error or on another approach?
Many thanks,
EL
Edit: I understand that the variable name length limit is 32 and that variable has exactly 32 characters, hence the error when I tried to rename it. But I need to come up with a systematic way to name these variables because multiple people work on it and I don't want to mess with the agreed organization of the dataset.
Your new name has 36 characters. There's a limit of 32 (with Stata 12 and 13, at least).
An example reproducing your error:
clear
set more off
set obs 1
gen q12_v1_995_oe_englishtranslation = 99
gen new_q12_v1_995_oe_englishtranslation = 10
Solution: make the name shorter.
See help varname for details.
Edit
On your question about renaming:
Try:
rename *englishtranslation *engtrans
See help rename and help rename group for details.
I am trying to execute the following code:
forval i = 1/51 {
// number of households
by hhid, sort: gen nvals = _n==1
count if (nvals & stateID == `i')
local stateTotalHH = r(N)
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
drop nvals
}
Everything works fine except if is not allowed with sum(). How can I estimate the total or the sum of all values in numper variable for each state and at household level?
ps:
I cannot use collapse numper, by(stateID) because I have other estimations
also, I cannot do the following: duplicates drop hhid, force
Your problem does not even call for sum() with if, so it is best to start at the beginning.
Reconstructing your problem, which is not well explained,
You have observations for individuals within households (identifier hhid) within 50 states of the USA and the District of Columbia (identifier stateID).
You have a variable numper, the number of persons per household, and you want the average per state.
Observations are repeated for each individual in a household, so it is necessary to use just one observation per household.
You can tag each household once by
egen tag = tag(hhid)
The average as a new variable would be
egen avPersonHH = mean(numper/tag), by(stateID)
Stata is going to average numper/tag which variously will be numper/1 and numper/0; the missings from the latter division will just be ignored, which is what is wanted.
That variable is repeated for each household. To see just one value for each stateID,
tabdisp stateID, cell(avPersonHH)
What is wrong with your code? Here is a partial list:
a. No loop is required.
b. If it were, the statement by hhid, sort: gen nvals = _n==1 should not be repeated.
c. sum() is a function for cumulative sums across observations, not what you want here.
d. The line
local avPersonHH`i' = sum(numper)/`stateTotalHH' if(nvals & stateID ==`i')
would at best calculate one number, but the if condition is misplaced. if whatever local ... often makes sense in Stata, but putting if on the right of a local definition only makes sense for manipulating text containing commands.
Your comment on this line misses these basic misconceptions, c. and d.
e. You were aiming to have collected 51 values of averages in as many local macros, but still need to put them somewhere useful.
f. Separate calculation of totals and numbers is not required, as you can get Stata to calculate the mean for you.
(LATER) This code plays along step by step with your aversion to using collapse and duplicates, the grounds for which are not stated. But most experienced Stata users would be happy to use brute force:
duplicates drop hhid, force
collapse numper, by(stateID)
and then merge back. That solution is not only direct, but also uses fewer idiosyncratic Stata details, which can take time to figure out.