Replace cases in one dataset using cases from another file

Replace cases in one dataset using cases from another file - replace

I have a master data file that contains responses from English, German, and French respondents. The open-ended responses (OER) were sent to translators and they sent us back a file with the original OER and English translation of those. Now I want to replace the "empty" columns reserved for English translation in my master data with the new information.
My approach was:
Create a loop in the translation file:
foreach var of varlist *englishtranslation* {
rename `var' new_`var'
}
Then merge new_`var' into master data using respondent ID.
Replace non-missing cases in blank cols using info in new_`var'.
Drop new_`var'.
However, Stata keeps saying that the new variable names new_`var' are invalid:
You attempted to rename q12_v1_995_oe_englishtranslation to
new_q12_v1_995_oe_englishtranslation. That is an invalid Stata
variable name.
Do you have any recommendation on fixing that error or on another approach?
Many thanks,
EL
Edit: I understand that the variable name length limit is 32 and that variable has exactly 32 characters, hence the error when I tried to rename it. But I need to come up with a systematic way to name these variables because multiple people work on it and I don't want to mess with the agreed organization of the dataset.

Your new name has 36 characters. There's a limit of 32 (with Stata 12 and 13, at least).
An example reproducing your error:
clear
set more off
set obs 1
gen q12_v1_995_oe_englishtranslation = 99
gen new_q12_v1_995_oe_englishtranslation = 10
Solution: make the name shorter.
See help varname for details.
Edit
On your question about renaming:
Try:
rename *englishtranslation *engtrans
See help rename and help rename group for details.

Related

Is there a way to keep observations from a dataset conditional on them being in some list?

I have one dta file that contains millions of observations, with about 4 variables. I only want to look at a subset of this data, for which the variable username is contained in a list of a few hundred usernames. I have two .dta files. One has the full set of data and the other has the "roster" which contains the usernames I want to look specifically at.
Looking through Stata documentation, it seems I want to use keep if exp But I do not know what to make the expression. I cannot even load the roster into Stata without clearing out the main dataset from my work space. How do I reference this separate dta document without clearing the main document?

The FAQ here is aimed at precisely this problem. merge the datasets and keep the intersection defined by _merge being 3.
In principle you could type out one or more commands defining a keep condition, but that is a poor solution as
It is tedious and error-prone.
inlist() with string arguments is fiddly in particular if that is part of the solution. (There could be much neater solutions if say what to keep can be expressed concisely.)
It is a waste of time and effort as you already have the inclusion information to hand.

The easiest way is keep if inlist(username, "user1", "user2", ...). The problem is, inlist() only allows up to 10 string values to compare. If you have more, you have to merge, or to use regular expressions.
Suppose we have this dataset, saved as all_users.dta:
input str6 username
"user_a"
"user_b"
"user_c"
"user_d"
"user_e"
"user_f"
"user_g"
"user_h"
"user_i"
"user_j"
"user_k"
"user_l"
"user_m"
"user_n"
"user_o"
"user_p"
"user_q"
"user_r"
"user_s"
"user_t"
end
And we have a second dataset, saved as usernames.dta:
input str6 username
"user_a"
"user_b"
"user_c"
"user_d"
"user_e"
"user_f"
"user_g"
"user_h"
"user_i"
"user_j"
"user_k"
"user_l"
"user_m"
"user_n"
"user_o"
end
Then these would be two ways to keep only the observations of all_users.dta where username is in usernames.dta:
*** MERGE ***
clear
use all_users
merge m:1 username using usernames
keep if _merge == 3
*** REGEX ***
clear
use usernames
levelsof username, local(usernames)
use all_users, clear
// Create regular expression
foreach username of local usernames {
local regex `regex'|`username'
}
local regex `=substr("`regex'", 2, .)'
keep if regexm(username, "^(`regex')$")

How can I resolve INDEX MATCH errors caused by discrepancies in the spelling of names across multiple data sources?

I've set up a Google Sheets workbook that synthesizes data from a few different sources via manual input, IMPORTHTML and IMPORTRANGE. Once the data is populated, I'm using INDEX MATCH to filter and compare the information and to RANK each data set.
Since I have multiple data inputs, I'm running into a persistent issue of names not being written exactly the same between sources, even though they're the same person. First names are the primary culprit (i.e. Mary Lou vs Marylou vs Mary-Lou vs Mary Louise) but some last names with special symbols (umlauts, accents, tildes) are also causing errors. When Sheets can't recognize a match, the INDEX MATCH and RANK functions both break down.
I'm wondering how to better unify the data automatically so my Sheet understands that each occurrence is actually the same person (or "value").
Since you can't edit the results of an IMPORTHTML directly, I've set up "helper columns" and used functions like TRIM and SPLIT to try and fix instances as I go, but it seems like there must be a simpler path.
It feels like IFS could work but I can't figure how to integrate it. Also thinking this may require a script, which I'm just beginning to study.
Here's a simplified example of what I'm trying to achieve and the corresponding errors: Sample Spreadsheet
The first tab is attempting to pull and RANK data from tabs 2 and 3. Sample formulas from the Summary tab, row 3 (Amelia Rose):
Cell B3: =INDEX('Q1 Sales'!B:B, MATCH(A3,'Q1 Sales'!A:A,0))
Cell C3: =RANK(B3,$B$2:B,1)
Cell D3: =INDEX('Q2 Sales'!B:B, MATCH(A3,'Q2 Sales'!A:A,0))
Cell E3: =RANK(D3,$D$2:D,1)
I'd be grateful for any insight on how to best index 'Q2Sales'!B3 as the correct value for 'Summary'!D3. Thanks in advance - the thoughtful answers on Stack Overflow have gotten me this far!

to counter every possible scenario do it like this:
=ARRAYFORMULA(IFERROR(VLOOKUP(LOWER(REGEXREPLACE(A2:A, "-|\s", )),
{REGEXEXTRACT(LOWER(REGEXREPLACE('Q2 Sales'!A2:A, "-|\s", )),
TEXTJOIN("|", 1, LOWER(REGEXREPLACE(A2:A, "-|\s", )))), 'Q2 Sales'!B2:B}, 2, 0)))

Stata : how to use variables as file name

I would like to use a variable (its value) as file name. Any ideas? Im using stata 14
Thanks a Lot in advance!

Per the comment from #toonice, please do give more details and I can better address your question.
However, you can use local macros to input into file names. Let's say you have a data set of a single variable x taking values of your filenames. You could loop through the data to save different files with the values of x. For example:
local N = _N
forvalues i = 1/`N' {
local myfilename x[`i']
// Insert code that changes data in some way to make files different
save ../output/`myfilename'_staticfilename.dta, replace
}
Give me more context and I am happy to provide more help.

How do I delete observations with no data in Stata?

I have data with IDs which may or may not have all values present. I want to delete ONLY the observations with no data in them; if there are observations with even one value, I want to retain them. Eg, if my data set is:
ID val1 val2 val3 val4
1 23 . 24 75
2 . . . .
3 45 45 70 9
I want to drop only ID 2 as it is the only one with no data -- just an ID.
I have tried Statalist and Google but couldn't find anything relevant.

This will also work with strings as long as they are empty:
ds id*, not
egen num_nonmiss = rownonmiss(`r(varlist)'), strok
drop if num_nonmiss == 0
This gets a list of variables that are not the id and drops any observations that only have the id.

Brian Albert Monroe is quite correct that anyone using dropmiss (SJ) needs to install it first. As there is interest in varying ways of solving this problem, I will add another.
foreach v of var val* {
qui count if missing(`v')
if r(N) == _N local todrop `todrop' `v'
}
if "`todrop'" != "" drop `todrop'
Although it should be a comment under Brian's answer, I will add here a comment here as (a) this format is more suited for showing code (b) the comment follows from my code above. I agree that unab is a useful command and have often commended it in public. Here, however, it is unnecessary as Brian's loops could easily start something like
foreach v of var * {
UPDATE September 2015: See http://www.statalist.org/forums/forum/general-stata-discussion/general/1308777-missings-now-available-from-ssc-new-program-for-managing-missings for information on missings, considered by the author of both to be an improvement on dropmiss. The syntax to drop observations if and only if all values are missing is missings dropobs.

Just another way to do it which helps you discover how flexible local macros are without installing anything extra to Stata. I rarely see code using locals storing commands or logical conditions, though it is often very useful.
// Loop through all variables to build a useful local
foreach vname of varlist _all {
// We don't want to include ID in our drop condition, so don't execute the remaining code if our loop is currently on ID
if "`vname'" == "ID" continue
// This local stores all the variable names except 'ID' and a logical condition that checks if it is missing
local dropper "`dropper' `vname' ==. &"
}
// Let's see all the observations which have missing data for all variables except for ID
// The '1==1' bit is a condition to deal with the last '&' in the `dropper' local, it is of course true.
list if `dropper' 1==1
// Now let's drop those variables
drop if `dropper' 1==1
// Now check they're all gone
list if `dropper' 1==1
// They are.
Now dropmiss may be convenient once you've downloaded and installed it, but if you are writing a do file to be used by someone else, unless they also have dropmiss installed, your code won't work on their machine.
With this approach, if you remove the lines of comments and the two unnecessary list commands, this is a fairly sparse 5 lines of code which will run with Stata out of the box.

How to post Stata program via Dropbox or private website?

Here is a sample program .do file, sampleprog.do:
program sampleprog
egen newVar = group (`1' `2')
end
How can I post it on my website (or dropbox), so that other people could install it to their Stata like this?
net from http://www.mywebsite.com/sampleprog.do
*** or may be like like this:
ssc install ...
I read the documentation about stata.toc...but I did not quite get it. What files should I upload and should it be one folder or what?
(PS: I definitely can simply email the .do file but this is not an option in my case.)

Here is a full explanation of how to share program or data files with others using your own website. I tried using Dropbox, but Stata 12 appears to have issues with https, which is the protocol for all Dropbox public links. If you want to use Dropbox, I recommend creating a shared folder that will sync on your collaborators' machines. The rest of this answer assumes you have a website serving pages over http or are using Stata 13, which supports https.
If this is a one-time thing, you can skip the rest of this answer by putting the file on your website and telling your collaborator to type:
. copy http://your-site.com/ado/program.ado program.ado
That will copy the ado file at the specified url into the user's current directory. If you want to provide information about your files, plan on sharing with multiple people and need to maintain/document a set files, read on!
Step 1 Create a folder on your website to hold the programs. I will call mine ado/
Step 2 Add the program files, help files, and data files you want to share. For this example, I have created a simple ado file called unique.ado with the following contents:
********************************************** unique.ado
capture program drop unique
program define unique
*! Count and number observations within group defined by varlist
* Example: unique person_id, obs(prow) tobs(pcount) sortby(time)
* to count and number rows by a variable called person_id
syntax varlist, obs(name) tobs(name) [sortby(varlist)]
bys `varlist' (`sortby') : gen long `obs' = _n
bys `varlist' (`sortby') : gen long `tobs' = _N
la var `obs' "Number of this row within `varlist' group."
la var `tobs' "Total number of rows with identical `varlist' values."
end
Step 3 Create a file called stata.toc to describe the files you wish to share. Here is mine:
********************************************** stata.toc
v 3
d Program to count observations by group
p unique [The unique.ado program for counting observations by group]
These files can be complicated. There are many features I won't cover here, but you can read this documentation to learn more.
Step 4 Create a package file for each of the packages defined by the lines in stata.toc that start with the letter p. Here is my package file for the unique package defined above:
********************************************** unique.pkg
v 3
d unique
d Program to count observations by group
d Distribution-Date: 28 June 2012
f unique.ado
Your directory now looks like this:
ado/
stata.toc
unique.ado
unique.pkg
Step 5 Use the site! Here are the commands to enter.
. net from http://example.com/ado/
. net describe unique
. net install unique
Here is what you'll see after entering the first command:
-----------------------------------------------------------------------------------
http://www.example.com/ado/
Program to count observations by group
-----------------------------------------------------------------------------------
PACKAGES you could -net describe-:
unique [The unique.ado program for counting observations by group]
-----------------------------------------------------------------------------------
The second command will tell you more about the package net describe unique:
---------------------------------------------------------------------------------------
package unique from http://www.example.com/ado
---------------------------------------------------------------------------------------
TITLE
unique
DESCRIPTION/AUTHOR(S)
Program to count observations by group
Distribution-Date: 28 June 2012
INSTALLATION FILES (type net install unique)
unique.ado
---------------------------------------------------------------------------------------
The third command will install the package net install unique:
checking unique consistency and verifying not already installed...
installing into /Users/cpoliquin/Library/Application Support/Stata/ado/plus/...
installation complete.
EDIT
See Nick's comments in the answer below. I intended this example to be simple and I don't expect other people to use this program. If you plan on submitting things to Stata Journal or SSC then his comments certainly apply! I hope this answer can serve as a decent tutorial for those confused by the official documentation.

This will be too long for a comment, so it is going to be an extra answer.
Your example uses the program name unique. If you search unique, all (or in Stata 13, search unique) you will find that a user-written program with the same name has been installed on SSC since 1998. This will create a clash of names for your users if (and only if) they attempt to use your program and also that earlier program. The more general advice is to search to see if a program name is already in use to try to avoid these problems.
Specifically, although you may just be using your unique as an arbitrary example, note that it contains bugs. An int doesn't contain enough bits to hold observation numbers exactly for large datasets. Also, as a matter of style, unique can change the sort order of your data, which is widely considered to be poor data management style.
Your example concerns dissemination of a program file without an accompanying help file. Suffice it to say that the SSC site would never accept such a program and the Stata Journal would not even review a paper based on such a submission before a help file was written to accompany it. Including explanatory comments with the code may be sufficient for your personal practices, but it falls below general Stata standards.
Stata 13 now supports https. See http://www.stata.com/manuals13/u.pdf, Section 3.6.
In short, I appreciate that you are trying to explain how to do something, but it is already well documented, and explicitly and implicitly some of your recommendations are below community standards.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js