Stata: Efficient way to replace numerical values with string values - stata

I have code that currently looks like this:
replace fname = "JACK" if id==103
replace lname = "MARTIN" if id==103
replace fname = "MICHAEL" if id==104
replace lname = "JOHNSON" if id==104
And it goes on for multiple pages like this, replacing an ID name with a first and last name string. I was wondering if there is a more efficient way to do this en masse, perhaps by using the recode command?

I will echo the other answers that suggest a merge is the best way to do this.
But if you absolutely must code the lines item-wise (again, messy) you can generate a long list ("pages") of replace commands by using MS Excel to "help" you write the code. Here is a picture of your Excel sheet with one example, showing the MS Excel formula:
columns:
A B C D
row: 1 last first id code
2 MARTIN JACK 103 ="replace fname=^"&B2&"^ if id=="&C2
You type that in, make sure it looks like Stata code when the formula calculates (aside from the carets), and copy the formula in column D down to the end of your list. Then copy the whole block of Stata code in column D generated by the formulas into your do-file, and do a find and replace (be careful here if you are using the caret elsewhere for mathematical uses!!) for all ^ to be replaced with ", which will end up generating proper Stata syntax.
(This is truly a brute force way of doing this, and is less dynamic in the case that there are subsequent changes to your generation list. All--apologies in advance for answering a question here advocating use of Excel :) )

You don't explain where the strings you want to add come from, but what is generally the best technique is explained at
http://www.stata.com/support/faqs/data-management/group-characteristics-for-subsets/index.html

Create an associative array of ids vs Fname,Lname
103 => JACK,MARTIN
104 => MICHAEL,JOHNSON
...
Replace
id => hash{id} ( fname & lname )
The efficiency of doing this will be taken care by the programming language used

Related

Googlesheet IF with multiple cases

in my Google sheet table I have the first list with summary of invoices which are then separated to 4 lists according to parameters (manually). I need to know about all invoices from the first list, on which category/list they are.
So for example - lists: Alphabet, abc, def, mno, xyz. In Alphabet is column "list".
How to write function which found invoice on another list according to ID (column B) from Alphabet and write name of the correct list to column "list". I tried to write this function using IF, match, etc. But I still don't have solution. Can you help me please? Sorry for my English :-)
So here is an example which you could adapt. In columns E:H on the first sheet (and I could hide these columns later, starting in row2 and dragging down as needed, I put the following formulas:
=IF(LEN(iferror(query(abc!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"abc","")
=IF(LEN(iferror(query(def!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"def","")
=IF(LEN(iferror(query(mno!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"mno","")
=IF(LEN(iferror(query(xyz!$A$2:$A,"select A where A='" & $A2 &"'"),""))>0,"xyz","")
Probably I could have simplified a little by putting the sheet names in E1:H1, but you get the idea.
Each of these looks for the ID. If the query succeeds, it returns the name of the sheet. If it fails, it returns the empty string.
Now in column B where I actually want the results, I put this formula in B2 and drag to copy as needed.
=if(E2&F2&G2&H2="","nowhere",E2&F2&G2&H2)
It says put those strings together, and if there is nothing there say nowhere, otherwise say the list. If it appears on more than one, and that can really happen, you could use JOIN instead.

Regex (re2 googlesheets) multiple values in multiline cell

Getting stuck on how to read and pretty up these values from a multiline cell via arrayformula.
Im using regex as preceding line can vary.
just formulas please, no custom code
The first column looks like a set of these:
```
[config]
name = the_name
texture = blah.dds
cost = 1000
[effect0]
value = 1000
type = ATTR_A
[effect1]
value = 8
type = ATTR_B
[feature0]
name = feature_blah
[components]
0 = comp_one,1
[resources]
res_one = 1
res_five = 1
res_four = 1
<br/>
Where to be useful elsewhere, at minimum it needs each [tag] set ([effect\d], [feature\d], ect) to be in one column each, for example the 'effects' column would look like:
ATTR_A:1000,ATTR_B:8
and so on.
Desired output can also be seen in the included spreadsheet
<br/>
<b>Here is the example spreadsheet:</b>
https://docs.google.com/spreadsheets/d/1arMaaT56S_STTvRr2OxCINTyF-VvZ95Pm3mljju8Cxw/edit?usp=sharing
**Current REGEXREPLACE**
Kinda works, finds each 'type' and 'value' great, just cant figure out how to extract just that from the rest, tried capture (and non-capturing) groups before and after but didnt work
=ARRAYFORMULA(REGEXREPLACE($A3:$A,"[\n.][effect\d][\n.](.)\n(.)","1:$1 2:$2"))
**Current SUBSTITUTE + REGEXEXTRACT + REGEXREPLACE**
A different approach entirely, also kinda works, longer form though and left with having to parse the values out of that string, where got stuck again. Idea was to use this to simplify, then regexreplace like above. Getting stuck removing content around the final matches though, and if can do that then above approach is fine too.
// First ran a substitute
=ARRAYFORMULA(SUBSTITUTE(SUBSTITUTE($A3:$A,char(10),";"),";;",char(10)))
// Then variation of this (gave up on single line 'effect/d' so broke it up to try and get it working)
=ARRAYFORMULA(IF(A3:A<>"",IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect0]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect1]);(.)$")&";;")&""&IFERROR(REGEXEXTRACT(A3:A,"(?m)^(?:[effect2]);(.)$")&";;"),""))
// Then use regexreplace like above
=ARRAYFORMULA(REGEXREPLACE($B3:$B,"value = (.);type = (.);;","1:$1 2:$2"))
**--EDIT--**
Also, as my updated 'Desired Output' sheet shows (see timestamped comment below), bonus kudos if you can also extract just the values of matching 'type's to those extra columns (see spreadsheet).
All good if you cant though, just realized would need that too for lookups.
**--END OF EDIT--**
<br/>
Ive tried dozens of things, discarding each in turn, had a quick look in version history to grab out two promising attempts and shared them in separate sheets.
One of these also used SUBSTITUTE to simplify input column, im happy for a solution using either RAW or the SUBSTITUTE results.
<br/>
**Potentially Useful links:**
https://github.com/google/re2/wiki/Syntax
<br/>
<b>Just some more words:</b>
I also have looked at dozens of stackoverflow and google support pages, so tried both REGEXEXTRACT and REGEXREPLACE, both promising but missing that final tweak. And i tried dozens of tweaks already on both.
Any help would be great, and hopefully help others in future since examples with spreadsheets are great since every new REGEX seems to be a new adventure ;)
<br/>
P.S. if we can think of better title for OP, please say in comment or your answer :)
paste in B3:
=ARRAYFORMULA(SUBSTITUTE(TRIM(TRANSPOSE(QUERY(TRANSPOSE(
IF(C3:E<>"", C2:E2&":"&C3:E, )),,999^99))), " ", ", "))
paste in C3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&C2)))
paste in D3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&D2)))
paste in E3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT($A3:$A, "(\d+)\ntype = "&E2)))
paste in F3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[feature\d+\]\nname = (.*)")))
paste in G3:
=ARRAYFORMULA(IFNA(REGEXEXTRACT(A3:A, "\[components\]\n\d+ = (.*)")))
paste in H3:
=ARRAYFORMULA(IFNA(REGEXREPLACE(INDEX(SPLIT(REGEXEXTRACT(
REGEXREPLACE(A3:A, "\n", ", "), "\[resources\], (.*)"), "["),,1), ", , $", )))
spreadsheet demo
This was a fun exercise. :-)
Caveat first: I have added some "input data". Examples:
[feature1]
name = feature_active_spoiler2
[components]
0 = spoiler,1
1 = spoilerA, 2
So the output has "extra" output.
See the tab ADW's Solution.

giving a string variable values conditional on another variable

I am using Stata 14. I have US states and corresponding regions as integer.
I want create a string variable that represents the region for each observation.
Currently my code is
gen div_name = "A"
replace div_name = "New England" if div_no == 1
replace div_name = "Middle Atlantic" if div_no == 2
.
.
replace div_name = "Pacific" if div_no == 9
..so it is a really long code.
I was wondering if there is a shorter way to do this where I can automate assigning values rather than manually hard coding them.
You can define value labels in one line with label define and then use decode to create the string variable. See the help for those commands.
If the correspondence was defined in a separate dataset you could use merge. See e.g. this FAQ
There can't be a short-cut here other than typing all the names at some point or exploiting the fact that someone else typed them earlier into a file.
With nine or so labels, typing them yourself is quickest.
Note that you type one statement more than you need, even doing it the long way, as you could start
gen div_name = "New England" if div_no == 1

Using Regex in Pig in hadoop

I have a CSV file containing user (tweetid, tweets, userid).
396124436476092416,"Think about the life you livin but don't think so hard it hurts Life is truly a gift, but at the same it is a curse",Obey_Jony09
396124436740317184,"“#BleacherReport: Halloween has given us this amazing Derrick Rose photo (via #amandakaschube, #ScottStrazzante) http://t.co/tM0wEugZR1” yes",Colten_stamkos
396124436845178880,"When's 12.4k gonna roll around",Matty_T_03
Now I need to write a Pig Query that returns all the tweets that include the word 'favorite', ordered by tweet id.
For this I have the following code:
A = load '/user/pig/tweets' as (line);
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT_ALL(line,'(.*)[,”:-](.*)[“,:-](.*)')) AS (tweetid:long,msg:chararray,userid:chararray);
C = filter B by msg matches '.*favorite.*';
D = order C by tweetid;
How does the regular expression work here in splitting the output in desired way?
I tried using REGEX_EXTRACT instead of REGEX_EXTRACT_ALL as I find that much more simpler, but couldn't get the code working except for extracting just the tweets:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'[,”:-](.*)[“,:-]',1)) AS (msg:chararray);
the above alias gets me the tweets, but if I use REGEX_EXTRACT to get the tweet_id, I do not get the desired o/p: B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'(.*)[,”:-]',1)) AS (tweetid:long);
(396124554353197056,"Just saw #samantha0wen and #DakotaFears at the drake concert #waddup")
(396124554172432384,"#Yutika_Diwadkar I'm just so bright 😁")
(396124554609033216,"#TB23GMODE i don't know, i'm just saying, why you in GA though? that's where you from?")
(396124554805776385,"#MichaelThe_Lion me too 😒")
(396124552540852226,"Happy Halloween from us 2 #maddow & #Rev_AlSharpton :) http://t.co/uC35lDFQYn")
grunt>
Please help.
Can't comment, but from looking at this and testing it out, it looks like your quotes in the regex are different from those in the csv.
" in the csv
” in the regex code.
To get the tweetid try this:
B = FOREACH A GENERATE FLATTEN(REGEX_EXTRACT(line,'.*(,")',1)) AS (tweetid:long);

How to transfer covariance-like table into one-to-one pairs cleverly in stata?

I encounter a practical issue on data management using Stata. What I'm planning to do is creating variable of spherical distances between 30 province capitals (so there are roughly 870 identical values) of China. There have been some user-written commands to handle this issue(through google map) but My problem is, for some confidential reason, the data is stored in a isolated computer that disconnected to internet,so I have to defined all of the one-to-one distance value in do-file and then merge them into the data. Given the nontrival workload (though not really infeasible), I wonder if there is some clever way to do the job. I have an excel worksheet in which the distance is like a covariance martrix with province capitals' names appearing in both first row and column,it's like a lower-triangle matrix, . stands for values
A B C D E AD
capital_1 capital_2 capital_3 capital_4 ······ capital_30
1 capital_1
2 capital_2 '
3 capital_3 ' '
4 capital_4 ' ' '
········
capital_30 ' ' ' '
I know how to import such martrix, but can I generate the desired one-to-one pairs? Thank you.
Thanks for #Roberto 's helpful suggestion, I have partly solved the problem, and I pose my solution to benefit any newcomer who may meet similar problem, and want to benefit from the real experts to improve my code
//import data from excel file
import excel using distance.xls
*********************************
* rename the variable name
* (to make it more well orgnized to facilitie use of --reshape-- command
*********************************
* rename the first column
ren A id
* rename the following variables
local i=0
foreach var of varlist B-AF {
local j=`i'+1
local i=`i'+1
ren `var' distance`j'
}
*********************************
* reshape the data
*********************************
reshape long distance,i(id) j(city)
tostring city,replace
***the following part is really urgly, because I don't know how
*to gen a one-to-one mapping between orginal province name and
*the new generated variable name like distance`j',I have to recover
*them by hand, I hope someone can help me to improve this part**
replace city="北京" if city=="1"
replace city="天津" if city=="2"
replace city="河北" if city=="3"
replace city="山西" if city=="4"
replace city="内蒙古" if city=="5"
replace city="辽宁" if city=="6"
replace city="吉林" if city=="7"
·····