How to extract specific information from strings

How to extract specific information from strings - stata

I have a dataset with the addresses of authors' affiliations. Addresses have differing length. But the information before the first comma is the name of he institution and that after the last comma the country. What I want to do is to extract the country and create a new variable for it.
I tried this code in Stata. It works to extract the name of institutions.
generate splitat = strpos(institutions ,",")
generate str80 univ = substr(institutions, 1, splitat - 1)
I am wondering whether this code also can be applied to extract the country.
I thought it could check from the end instead from the start?
My dataset looks like the following example:
Natl Taiwan Univ, Inst Epidemiol, Taipei 106, Taiwan
Radboud Univ Nijmegen, Inst Water & Wetland Res, Dept Anim Ecol & Ecophysiol, NL-6525 AJ Nijmegen, Netherlands

There is a specific function in Stata 14+ to look for the last occurrence of a substring (e.g. a specific character) in a string. See help string functions in Stata 14 for documentation of strrpos().
If that is not in your version of Stata, you merely reverse the string, find the substring using the method you already know, and then reverse what you found.
If you are not using the latest version of Stata, it is always a good idea to specify that in questions in any forum that supports Stata questions,
clear
input str244 institutions
"Natl Taiwan Univ, Inst Epidemiol, Taipei 106, Taiwan"
"Radboud Univ Nijmegen, Inst Water & Wetland Res, Dept Anim Ecol & Ecophysiol, NL-6525 AJ Nijmegen, Netherlands"
end
compress
gen country = substr(institutions, strrpos(institutions, ",") + 1, .)
local rev strreverse(institutions)
gen country2 = strreverse(substr(`rev', 1, strpos(`rev', ",") - 1))
assert country == country2
l country
+--------------+
| country |
|--------------|
1. | Taiwan |
2. | Netherlands |
+--------------+

Related

Using dictionary in regexp_replace function in pyspark

I want to perform an regexp_replace operation on a pyspark dataframe column using dictionary.
Dictionary : {'RD':'ROAD','DR':'DRIVE','AVE':'AVENUE',....}
The dictionary will have around 270 key value pair.
Input Dataframe:
ID | Address
1 | 22, COLLINS RD
2 | 11, HEMINGWAY DR
3 | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR
Desired Output Dataframe:
ID | Address | Address_Clean
1 | 22, COLLINS RD | 22, COLLINS ROAD
2 | 11, HEMINGWAY DR | 11, HEMINGWAY DRIVE
3 | AVIATOR BUILDING | AVIATOR BUILDING
4 | 33, PARK AVE MULLOHAND DR | 33, PARK AVENUE MULLOHAND DRIVE
I cannot find any documentation on internet. And if trying to pass dictionary as below codes-
data=data.withColumn('Address_Clean',regexp_replace('Address',dict))
Throws an error "regexp_replace takes 3 arguments, 2 given".
Dataset will be around 20 million in size. Hence, UDF solution will be slow (due to row wise operation) and we don't have access to spark 2.3.0 which supports pandas_udf.
Is there any efficient method of doing it other than may be using a loop?

It is trowing you this error because regexp_replace() needs three arguments:
regexp_replace('column_to_change','pattern_to_be_changed','new_pattern')
But you are right, you don't need a UDF or a loop here. You just need some more regexp and a directory table that looks exactly like your original directory :)
Here is my solution for this:
# You need to get rid of all the things you want to replace.
# You can use the OR (|) operator for that.
# You could probably automate that and pass it a string that looks like that instead but I will leave that for you to decide.
input_df = input_df.withColumn('start_address', sf.regexp_replace("original_address","RD|DR|etc...",""))
# You will still need the old ends in a separate column
# This way you have something to join on your directory table.
input_df = input_df.withColumn('end_of_address',sf.regexp_extract('original_address',"(.*) (.*)", 2))
# Now we join the directory table that has two columns - ends you want to replace and ends you want to have instead.
input_df = directory_df.join(input_df,'end_of_address')
# And now you just need to concatenate the address with the correct ending.
input_df = input_df.withColumn('address_clean',sf.concat('start_address','correct_end'))

How to know if a variation (f.e. abbreviation) of a string in a list does match agains another list if the original does not?

I currently searching for a method in R which let's me match/merge two data frames. Helas both of these data frames contain non optimal data. They can have certain abbreviations of even typo's in them. Therefore I would like to define a list for each abbreviation and if a string contains one of those elements. If the original entries don't match, R should check if any of the other options of the abbreviation has a match. To illustrate: the name of a company could end with "Limited" but also with "Ltd." of "Ltd" etc.
EXAMPLE
Data
The Original "Address" file contains:
Company name Address
Deloitte Ltd. New York
Coca-Cola New York
Tesla ltd California
Microsoft Limited Washington
Would have to be merged with "EnterpriseNrList"
Company name EnterpriseNumber
Deloitte Ltd. 221
Coca-Cola 334
Tesla ltd 725
Microsoft Limited 127
So the abbreviations should work in "both directions". That's why I said, if R recognises any of the abbreviations, R should try to match all of them.
All of the matches should be reported as the return.
Therefore I would make up a list "Abbreviations" for each possible abbreviation
Limited.
limited
Ltd.
ltd.
Ltd
ltd
Questions
1) Would this be a good method, or would there be a more efficient way?
2) How can I check a list against a list of possible abbreviations (step 1, see below), sort of a containsx from excel?
3) How could I make up a list that replaces for the entries that do not match the abbreviation with all other abbreviatinos (step 2, see below)?
Thoughts for solution
Step 1
As I am still very new to this kind of work, I was thinking the following: use a regex expression to filter out wether a string contains any of the abbreviation options and create a list which will then contain either -1 if no match could be found and >0 if match is found. The no pattern matching can already be matched against the "Address" list. With the other entries I continue to step 2.
In this step I don't really know how to check against a list of options ("Abbreviations" list).
Step 2
Next I would create a list with the matches from step 1 and rbind together all options. In this step I don't really know to I could create a list that combines f.e. Coca-Cola with all it's possible abbreviations.
Coca-Cola Limited
Coca-Cola Ltd.
Coca-Cola Ltd
etc.
Step 3
Lastly I would match/merge this more complete list of companies again with the original "Data" list. With the introduction of step 2 I thought It might be a bit easier on the required computing power, as the original list is about 8000 rows.

I would go in a different approach, fixing the tables first before the merge.
To fix with abreviations, I would use a regex, case insensitive, the final dot being optionnal, I start with a list of 'Normal word' = vector of abbreviations.
abbrevs <- list('Limited'=c('Limited','Ltd'),'Incorporated'=c('Incorporated','Inc'))
The I build the corresponding regex (alternations with an optional dot at end, the case will be ignored by parameter in gsub and agrep later):
regexes <- lapply(abbrevs,function(x) { paste0("(",paste0(x,collapse='|'),")[.]?") })
Which gives:
$Limited
[1] "(Limited|Ltd)[.]?"
$Incorporated
[1] "(Incorporated|Inc)[.]?"
Now we have to apply each regex to the company.name column of each df:
for (i in seq_along(regexes)) {
Address$Company.name <- gsub(regexes[[i]], names(regexes[i]), Address$Company.name, ignore.case=TRUE)
Enterprise$Company.name <- gsub(regexes[[i]], names(regexes[i]), Enterprise$Company.name, ignore.case=TRUE)
}
This does not take into account typos. Here you'll need to work on with agrepor adist to manage it.
Result for Address example data set:
> Address
Company.name Address
1 Deloitte Limited New York
2 Coca-Cola New York
3 Tesla Limited California
4 Microsoft Limited Washington
Input data used:
Address <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), Address = c("New York", "New York",
"California", "Washington")), .Names = c("Company.name", "Address"
), class = "data.frame", row.names = c(NA, -4L))
Enterprise <- structure(list(Company.name = c("Deloitte Ltd.", "Coca-Cola",
"Tesla ltd", "Microsoft Limited"), EnterpriseNumber = c(221L,
334L, 725L, 127L)), .Names = c("Company.name", "EnterpriseNumber"
), class = "data.frame", row.names = c(NA, -4L))

I would say that the answer depends on whether you have a list of abbreviations or not.
If you have one, you could just look which element of your list contains an abbreviation with grep or greplfunctions. (grep return all indexes that have a matching pattern whereas grepl returns a logical vector).
Also, use the ignore.case= TRUE parameter of these function, so you don't have to try all capitalized/lowercase possibilities.
If you don't have such a list, my first guest would be to extract the first "word" of each company (I would guess that there is a single "Deloitte" company, and that it is "Deloitte Ltd"). You can do so with:
unlist(strsplit(CompanyNames,split = " "))
If you wanted to also correct for typos, this is more a question of string distance.
Hope that it helped!

R - grab the exact 8 digits number in a string and transform it

I have 2 problems in extracting and transforming data using R.
Here's the dataset:
messageID | msg
1111111111 | hey id 18271801, fix it asap
2222222222 | please fix it soon id12901991 and 91222911. dissapointed
3333333333 | wow $300 expensive man, come on
4444444444 | number 2837169119 test
The problem is:
I want to grab the number with only 8 digits length. In the dataset above, message id 3333...(300 - 3 digits) and 4444...(2837169119 - 10 digits) should not included. And here's my best shot so far:
as.matrix(unlist(apply(df[2],1,function(x){regmatches(x,gregexpr('([0-9]){8}', x))})))
.
However, with this line of code, message 444... is included because is contains more than 8 digits number.
Transform the data to another form like this:
message_id | customer_ID
1111111111 | 18271801
2222222222 | 12901991
2222222222 | 91222911
I don't know how to efficiently transform the data.
The output of dput(df):
structure(list(id = c(1111111111, 2222222222, 3333333333, 4444444444
), msg = c("hey id 18271801, fix it asap", "please fix it soon id12901991 and 91222911. dissapointed",
"wow $300 expensive man, come on", "number 2837169119 test")), .Names = c("id",
"msg"), row.names = c(NA, 4L), class = "data.frame")
Thanks

Use rebus to create your regular expression, and stringr to extract the matches.
You may need to play with the exact form of the regular expression. This code works on your examples, but you'll probably need to adapt it for your dataset.
library(rebus)
library(stringr)
# Create regex
rx <- negative_lookbehind(DGT) %R%
dgt(8) %R%
negative_lookahead(DGT)
rx
## <regex> (?<!\d)[\d]{8}(?!\d)
# Extract the IDs
extracted_ids <- str_extract_all(df$msg, perl(rx))
# Stuff the IDs into a data frame.
data.frame(
messageID = rep(
df$id,
vapply(extracted_ids, length, integer(1))
),
extractedID = unlist(extracted_ids, use.names = FALSE)
)

Stata - inputting data from .txt with "" and ,

I am using perl to scrape the following through .txt which I'd ultimately bring into Stata. What format option works? I have many such observations, so would like to use an approach over which I can generalize.
The original data are of the form:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
Location: District 1, Ocean City, Cape May, New Jersey, USA
First Name: Lee Roy
Last Name: McBride
Birth Year: 1967
Location: Precinct 5, District 2, Chicago, Cook, Illinois, USA
The goal is to create the variables in Stata:
First Name: Allen
Last Name: Von Schmidt
Birth Year: 1965
County: Cape May
State: New Jersey
First Name: Allen
Last Name: McBride
Birth Year: 1967
County: Cook
State: Illinois
What possible .txt might lead to such, and how would I load it into Stata?
Also, the amount of terms vary in Location as in these 2 examples, but I always want the 2 before USA.
At the moment, I am putting "", around each variable from the table for the .txt.
"Allen","Von Schmidt","1965","District 1, Ocean City, Cape May, New Jersey, USA"
"Lee Roy","McBride","1967","Precinct 5, District 2, Chicago, Cook, Illinois, USA"
Is there a better way to format the .txt? How would I create the corresponding variables in Stata?
Thank you for your help!
P.S. I know that stata uses infile or insheet and can handle , or tabs to separate variables. I did not know how to scrape a variable like Location in perl with all of the those so I added the ""

There are two ways to do this. The first is to paste the data into your do-file and use input. Assuming the format is fairly regular, you can clean it up easily using commas to parse. Note that I removed the commas:
#delimit;
input
str100(first_name last_name yob geo);
"Allen" "Von Schmidt" "1965" "District 1, Ocean City, Cape May, New Jersey, USA";
end;
compress;
destring, replace;
split geo, parse(,);
rename geo1 district;
rename geo2 city;
rename geo3 county;
rename geo4 state;
rename geo5 country;
drop geo;
The second way is to insheet the data from the txt file directly, which is probably easier. This assumes that the commas were not removed:
#delimit;
insheet first_name last_name yob geo using "raw_data.txt", clear comma nonames;
Then clean it up as in the first example.

This isn't a complete answer, but I need more space and flexibility than comments (easily) allow.
One trick is based on peeling off elements from the end. The easiest way to do that could be to start looking for the last comma, which is in turn the first comma in the reversed string. Use strpos(reverse(stringvar), ",").
For example the first commma is found by strpos() like this
. di strpos("abcd,efg,h", ",")
5
and the last comma like this
. di strpos(reverse("abcd,efg,h"), ",")
2
Once you know where the last comma is you can peel off the last element. If the last comma is at position # in the reversed string, it is at position -# in the string.
. di substr("abcd,efg,h", -2, 2)
,h
These examples clearly are calculator-style examples for single strings. But the last element can be stripped off similarly for entire string variables.
. gen poslastcomma = strpos(reverse(var), ",")
. gen var_end = substr(var, -poslastcomma, poslastcomma)
. gen var_begin = substr(var, 1, length(var) - poslastcomma)
Once you get used to stuff like this you can write more complicated statements with fewer variables, but slowly, slowly step by step is better when you are learning.
By the way, a common Stata learner error (in my view) is to assume that a solution to a string problem must entail the use of regular expressions. If you are very fluent at regular expressions, you can naturally do wonderful things with them, but the other string functions in conjunction can be very powerful too.
In your specific example, it sounds as if you want to ignore a last element such as "USA" and then work in turn on the next elements working backwards.
split in Stata is fine too (I am a fan and indeed am its putative author) but can be awkward if a split yields different numbers of elements, which is where I came in.

Vim: Parsing address fields from all around the globe

Intro
This post is long, but I consider it thorough. I hope this post might be helpful (addresses) to others while teaching complex VIM regexes. Thank you for your time.
Worldwide addresses:
American, Canadian and a few other countries are offered 5 fields on a form, which is then displayed in a comma delimited format that I need to further dissect. Ideally, the comma-separated content looks like:
Some Really Nice Place, 111 Street, Beautiful Town, StateOrProvince, zip
where zip can be either a series of just numbers (US) or numbers and letters (Canada).
Invariably, people throw an extra comma into their text box field input and that adds some complexity to the parsing of this data. For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, StateOrProvince, zip
Further complicating this parse is that the data from non-US and non-Canadian countries contains an extra comma-delimited field that was somehow provided to them - adding a place for them to enter their country. (No, there is no "US" or "Canada" field for their entries. So, it's "in addition" to the original 5 comma-delimited fields.) Such as:
Foreign Name of Building, A street name, A City, ,zip, Country
The ",," is usually empty as non-US countries do are not segmented into states. And, yes, the same "additional commas" as described above happens here too.
Foreign Name of Building, cross streets, district, A street name, A City, ,zip, Country
Parsing Strategy:
A country name will never include a digit, whereas a US or Canadian zip will always have at least some digits. If you go backwards using this assumption about the contents of the last field then you should be able to place the country, zip, State (if not empty ",,"), City and Street into their respect positions - which are the most important fields to get right. Anything beyond those sections could be lumped together in the first or or two lines as descriptions of the address (i.e. building, name, suite, cross streets, etc). For example:
Some Really Nice Place, 111 Street, suite 101, Beautiful Town, Lovely State, Digits&Letters
Last section has a digit (therefore a US or Canadian address)
There a total of 6 sections, so that's one more than the original 5
Knowing that sections 5-2 are zip, state, town, address...
6 minus 5 (original) = add an extra Address (Address2) field and leave the first section as the header, resulting in:
Header: Some Really Nice Place, Address1: 111 Street, Address2: Suite 101, Town: Beautiful Town, State/Province: Lovely State, Zip: Digits&Letters
Whereas there might be a discrepancy on where "111 Street" or "Suite 101" goes (Address1 or Address2), it at least gets the zip, state, city and address(s) lumped together and leaves the first section as the "Header" to the email address for data entry purposes.
Under this approach, foreign address get parsed like:
Foreign Name of Building, cross streets, district, A street name, A
City, ,zip, Country
Last section has no digit, so it must be a Country
That means, moving right to left, the second section is the zip
So now (foreign) you have an "original 6 sections" to subtract from the total of 7 in the example
7th section = country, 6th = zip, 5th = state (mostly blank on foreign address), 4th = City, 3rd = address1, 2nd = address2, 1st = header
We knew to use two address fields because the example had 7 sections and foreign addresses have a base of 6 sections. Any number of sections above the base are added to a second address2 field. If there are 3 sections above the base section count then they are appended to each inside the address2 field.
Coding
In this approach using VIM, how would I initially read the number of comma-delimited sections (after I've captured the entire address in a register)? How do I do submatch(es) on a series of comma-delimited sections for which I am not sure the number of sections that exist?
Example Addresses
Here are some practice address (US and Foreign) if you are so inclined to help:
City Gas & Electric - Bldg 4, 222 Middle Park Ct, CP4120F, Dallas, Texas, 44984
MHG Engineering, Inc. Suite 200, 9899 Balboa Ave, San Diego, California, 92123-1502
SolarWind Turbines, 2nd Floor Conference Room, 2300 Ruffin Road, Seattle, Washington, 84444
123 Aeronautics, 2239 Industry Parkway, Salt Lake City, Utah, 55344
Ongwanda Gov't Resources, 6000 Portsmouth Avenue, Ottawa, Ontario, K7M 8A6
Graylang Seray Center, 6600 Haig Rd, Singapore, , 437848, Singapore
Lot 459, Block 14, Jalan Sultan Tengah, Petra Jaya, Kuching, , 93050, Malaysia
Virtual Steel, 1 Umgazi Rd Aspec Park, Pretoria, , 0075, South Africa
Idiom Towers South, Fifth Floor, Jasmen Conference Room, 1500 Freedom Street, Pretoria, , 0002, South Africa

The following code is a draft-quality Vim script (hopefully) implementing the
address parsing routine described in the question.
function! ParseAddress(line)
let r = split(a:line, ',\s*', 1)
let hadcountry = r[-1] !~ '\d'
let a = {}
let a.country = hadcountry ? r[-1] : ''
let r = r[:-1-hadcountry]
let a.zip = r[-1]
let a.state = r[-2]
let a.city = r[-3]
let a.header = r[0]
let nleft = len(r) - 4
if hadcountry
let a.address1 = r[-4]
let a.address2 = join(r[1:nleft-1], ', ')
else
let a.address1 = r[1]
let a.address2 = join(r[2:nleft], ', ')
endif
return a
endfunction
function! FormatAddress(a)
let t = map([
\ ['Header', 'header'],
\ ['Address 1', 'address1'],
\ ['Address 2', 'address2'],
\ ['Town', 'city'],
\ ['State/Province', 'state'],
\ ['Country', 'country'],
\ ['Zip', 'zip']],
\ 'has_key(a:a, v:val[1]) && !empty(a:a[v:val[1]])' .
\ '? v:val[0] . ": " . a:a[v:val[1]] : ""')
return join(filter(t, '!empty(v:val)'), '; ')
endfunction
The command below can be used to test the above parsing routines.
:g/\w/call setline(line('.'), FormatAddress(ParseAddress(getline('.'))))
(One can provide a range to the :global command to run it through fewer
number of test address lines.)

Maybe you should review some of the other questions about addresses around the world. The USA and Canada are extraordinarily systematic with their systems; most other countries are a lot less rigorous about the approved formats. Anything you devise for the USA and Canada will run into issues almost immediately you deal with other addresses.
Best practices for storing postal addresses in a database
Is there a common street address database design for all addresses of the world
How many address fields would you use for a UK address
ISO Standard Street Addresses
There are probably other related questions: see the tag street-address for some of them.

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js