Stata Regular expressions extracting numerical values - stata

I have some data that looks like this
var1
h 01 .00 .0 abc
d 1.0 .0 14.0abc
1,0.0 0.0 .0abc
It should be noted that the last three alpha values are the same, and I am hoping to extract all the numerical values within the string. The code that I'm using look like this
gen x1=regexs(1) if regexm(var1,"([0-9]+) [ ]*(abc)*$")
However, this code only extracts the numbers before the abc term and stops after a space or a .. For example, only 0 before abc is extracted from the first term. I was wondering whether there is a way to handle this and extract all the numerical values before the alpha characters.

As #Roberto Ferrer points out, your question isn't very clear, but here is an example using moss from SSC:
. clear
. input str16 var1
var1
1. "h 01 .00 .0 abc"
2. "d 1.0 .0 14.0abc"
3. "1,0.0 0.0 .0abc"
4. end
. moss var1, regex match("([0-9]+\.*[0-9]*|\.[0-9]+)")
. l _match*
+---------------------------------------+
| _match1 _match2 _match3 _match4 |
|---------------------------------------|
1. | 01 .00 .0 |
2. | 1.0 .0 14.0 |
3. | 1 0.0 0.0 .0 |
+---------------------------------------+

Related

Matching values in a variable by year

I have the following minimal example:
input str5 name year match1 match2 match3
Alice 2000 . . .
Alice 2000 . . .
Bob 2000 . . .
Carol 2001 0 . .
Alice 2002 0 1 .
Carol 2002 1 0 .
Bob 2003 0 0 1
Bob 2003 0 0 1
end
I have data on name and year, and I want to create binary variables called match'year' that equals 1 if this name is in the data previous 'year'. For example, looking at the first observation in Stata, match1 is a binary variable that equals 1 if Alice appears in year 1999, and match2 is a binary variable that equals 1 if Alice appears in 1998, etc.
If there is no year prior to that year (in this case there is no 1999 or 1998), the binary variable will be missing.
How can I construct these match variables? Note that I have millions of unique names, and using command levelsof name, local(match) results in macro substitution results in line that is too long error. Also note that there are sometimes duplicates of names in a given year, and some names may be missing in a given year.
Thanks for the data example. Here is some technique using rangestat from SSC. I don't understand your rule on which values should be 0 and which missing.
* Example generated by -dataex-. For more info, type help dataex
clear
input str5 name float year
"Alice" 2000
"Alice" 2000
"Alice" 2002
"Bob" 2000
"Bob" 2003
"Bob" 2003
"Carol" 2001
"Carol" 2002
end
gen one = 1
forval j = 1/3 {
rangestat (max) match`j'=one, int(year -`j' -`j') by(name)
}
drop one
sort name year
list, sepby(year)
+-----------------------------------------+
| name year match1 match2 match3 |
|-----------------------------------------|
1. | Alice 2000 . . . |
2. | Alice 2000 . . . |
|-----------------------------------------|
3. | Alice 2002 . 1 . |
|-----------------------------------------|
4. | Bob 2000 . . . |
|-----------------------------------------|
5. | Bob 2003 . . 1 |
6. | Bob 2003 . . 1 |
|-----------------------------------------|
7. | Carol 2001 . . . |
|-----------------------------------------|
8. | Carol 2002 1 . . |
+-----------------------------------------+
As the original author of levelsof I find it a little melancholy to see it pressed into service where it is of little or no help.
Here is an alternative approach using frames:
keep name year
frame copy default prev
frame prev: duplicates drop
frame prev: rename year myear
gen myear=.
forvalues i=1/3 {
replace myear = year-`i'
frlink m:1 name myear, frame(prev) generate(match`i')
replace match`i' = 1 if match`i'!=.
}
drop myear
Output:
name year match1 match2 match3
1. Alice 2000 . . .
2. Alice 2000 . . .
3. Bob 2000 . . .
4. Carol 2001 . . .
5. Alice 2002 . 1 .
6. Carol 2002 1 . .
7. Bob 2003 . . 1
8. Bob 2003 . . 1

Remove "." from digits

I have a string in the following way =
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
I want to convert it into :
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
that is I only convert a.b -> ab where a and b are integer
waiting for help
Assuming you are using Python. You can use captured groups in regex. Either numbered captured group or named captured group. Then use the groups in the replacement while leaving out the ..
import re
text = "lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs."
Numbered: You reference the pattern group (content in brackets) by their index.
text = re.sub("(\d+)\.(\d+)", "\\1\\2", text)
Named: You reference the pattern group by a name you specified.
text = re.sub("(?P<before>\d+)\.(?P<after>\d+)", "\g<before>\g<after>", text)
Which each returns:
print(text)
> lmn abc 40mg 350 mg over 12 days. Standing nebs.
However you should be aware that leaving out the . in decimal numbers will change their value. So you should be careful with whatever you are doing with these numbers afterwards.
Using any sed in any shell on every Unix box:
$ sed 's/\([0-9]\)\.\([0-9]\)/\1\2/g' file
"lmn abc 40mg 350 mg over 12 days. Standing nebs."
Using sed
$ cat input_file
"lmn abc 4.0mg 3.50 mg over 12 days. Standing nebs. a.b.c."
$ sed 's/\([a-z0-9]*\)\.\([a-z0-9]\)/\1\2/g' input_file
"lmn abc 40mg 350 mg over 12 days. Standing nebs. abc."
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e '$_.gsub!(/\d+\K\.(?=\d+)/, "")'
Output
12 123 1234 1. .2
If performance matters:
echo '1.2 1.23 12.34 1. .2' |
ruby -p -e 'BEGIN{$regex = /\d+\K\.(?=\d+)/; $empty_string = ""}; $_.gsub!($regex, $empty_string)'

destring 18 digit number returns rounding error

I have IDs with a length of 18 as strings and I want to transform them to a numeric variable.
I tried the follwoing code:
destring var, replace
That returns the variable with a numeric format. However, the last digit of the ID includes a rounding error. E.g.: 123456789123456000 --> 123456789123456001
How can I destring my values without any change in the ID?
I can't reproduce that as this shows:
. clear
. set obs 1
number of observations (_N) was 0, now 1
. gen test = "123456789123456000"
. destring, gen(check)
test: all characters numeric; check generated as double
. l
+--------------------------------+
| test check |
|--------------------------------|
1. | 123456789123456000 1.235e+17 |
+--------------------------------+
. format check %23.0f
. l
+-----------------------------------------+
| test check |
|-----------------------------------------|
1. | 123456789123456000 123456789123456000 |
+-----------------------------------------+
The other way round does produce error as the string "123456789123456001" maps to 123456789123456000. In essence, you are bumping up against what can be held exactly in a double with 8 bytes for each number.

Group together records that share either one of two variables

I am working with Stata and I have a large data set where I need to group records together if they share one of two variables.
For example, take the following three observations:
Observation | matching var1 | matching var2
1 xxx aaa
2 xxx bbb
3 yay bob
If I were to group the records by var1, the first two observations will be in the same group and the last observation will be in a separate group. Similarly, if I were to group using var2, observations two and three will be in in the same group and observation one will be in a separate group. However, if I were to group the records based on a match on either var1 or var2, all observations will be in the same group.
I would like to create a 'group id' variable that will take the same value across all these records.
Any suggestions on how I should go about it?
The community-contributed group_twoway (available in SSC) can match two variables:
ssc install group_twoway
Using an additional example from yours:
clear
input str3(var1 var2)
"xxx" "aaa"
"yyy" "bbb"
"mmm" "ccc"
"nnn" "ccc"
"mmm" "ddd"
"ooo" "ff"
"pp" "eee"
"qq" "ff"
"rr" "u"
"xxx" "bbb"
end
group_twoway var1 var2, generate(group_id)
Result # of obs.
-----------------------------------------
not matched 0
matched 10
-----------------------------------------
list, sepby(group_id) constant
+------------------------+
| var1 var2 group_id |
|------------------------|
1. | xxx aaa 1 |
2. | yyy bbb 1 |
|------------------------|
3. | mmm ccc 2 |
4. | nnn ccc 2 |
5. | mmm ddd 2 |
|------------------------|
6. | ooo ff 3 |
|------------------------|
7. | pp eee 4 |
|------------------------|
8. | qq ff 3 |
|------------------------|
9. | rr u 5 |
|------------------------|
10. | xxx bbb 1 |
+------------------------+

Regex to extract (german) street number

I have the following street constellations:
| Street name | extracted value |
| --------------------------------------- | --------------- |
| Lilienstr. 12a | 12a |
| Hagentorwall 3 | 3 |
| Seilerstr. 14 (Eingang Birkenstr.) | 14 |
| Guentherstr. 43 B | 43 B |
| Eberhard-Leibnitz Str. 1 WH 5B 241 | 1 |
| 1019-1781 Borderlinx C/O SEKO Logistics | - |
My Regex is partially working (https://regex101.com/r/KumamP/2):
\d+(?:[a-zA-Z]$|\s[a-zA-Z]$)?
Someone has got a better solution for me? Eberhard-Leibnitz Str. should only give me one result or none. 1019-1781 Borderlinx C/O SEKO Logistics should give me none result.
The following regex is working for your example
^[ \-a-zA-Z.]+\s+(\d+(\s?\w$)?)
https://regex101.com/r/KumamP/4
The basic assumption is (like your samples suggest), that valid "street constellations" always start with a street name followed by the street/house number.
The next regex is also working if there is an entry like Straße des 17. Juni 1:
^[ \-0-9a-zA-ZäöüÄÖÜß.]+?\s+(\d+(\s?[a-zA-Z])?)\s*(?:$|\(|[A-Z]{2})
https://regex101.com/r/KumamP/5
But as the commentators already wrote, it is difficult to distinguish via an regular expression between numerical street name parts and the street number. Even more if you allow "unspecified" suffixes like (Eingang Birkenstr.) or WH 5B 241 in your example.
Parsing address lines is not trivial. Many countries have their own special rules and Germany and Austria are really tricky.
To understand better the examples you provided, there's one in special that shows the point:
"Eberhard-Leibnitz Str. 1 WH 5B 241"
The "WH" here stands for "Wohnung", but they usually use just "W" (and use some separator like "//"). So it would be more like:
"Eberhard-Leibnitz Str. 1 // W 5B 241"
It's also common to find "co" or "c/o" or "z. H" (abbreviation for "zu Händen von"). And anything that follows it, it's just the mailbox's name.
And last but not least, the address line could also contain the zip code + city name. Depends on the API you're interacting with, or if it's user input (it can get very wild then!).
So, to properly parse address lines, you should first normalize them, by removing that extra information. Then you can use a regex. Take a look at this gem: https://github.com/matiasalbarello/address_line_divider
Some good reads about the topic:
https://www.german-way.com/germans-we-dont-need-apartment-numbers/
https://allaboutberlin.com/guides/addressing-a-letter-in-germany
http://interactive.zeit.de/strassennamen/