Regex extraction of text data between 2 commas in R

Regex extraction of text data between 2 commas in R - regex

I have a bunch of text in a dataframe (df) that usually contains three lines of an address in 1 column and my goal is to extract the district (central part of the text), eg:
73 Greenhill Gardens, Wandsworth, London
22 Acacia Heights, Lambeth, London
Fortunately for me in 95% of cases the person inputing the data has used commas to separate the text I want, which 100% of the time ends ", London" (ie comma space London). To state things clearly therefore my goal is to extract the text BEFORE ", London" and AFTER the preceding comma
My desired output is:
Wandsworth
Lambeth
I can manage to extract the part before:
df$extraction <- sub('.*,\\s*','',address)
and after
df$extraction <- sub('.*,\\s*','',address)
But not the middle part that I need. Can someone please help?
Many Thanks!

You could save yourself the headache of a regular expression and treat the vector like a CSV, using a file reading function to extract the relevant part. We can use read.csv(), taking advantage of the fact that colClasses can be used to drop columns.
address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London"
)
read.csv(text = address, colClasses = c("NULL", "character", "NULL"),
header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"
Or we could use fread(). Its select argument is nice and it strips white space automatically.
data.table::fread(paste(address, collapse = "\n"),
select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth"

Here are a couple of approaches:
# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"
Or
# target the whole string, but use a capture group
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth"

Here are two options that aren't dependent on the city name being the same. The first uses a regex pattern with stringr::str_extract():
raw_address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London",
"Street, District, City"
)
df <- data.frame(raw_address, stringsAsFactors = FALSE)
df$distict = stringr::str_extract(raw_address, '(?<=,)[^,]+(?=,)')
> df
raw_address distict
1 73 Greenhill Gardens, Wandsworth, London Wandsworth
2 22 Acacia Heights, Lambeth, London Lambeth
3 Street, District, City District
The second uses strsplit() and makes getting the other elements of the address easier:
df$address <- sapply(strsplit(raw_address, ',\\s*'), `[`, 1)
df$distict <- sapply(strsplit(raw_address, ',\\s*'), `[`, 2)
df$city <- sapply(strsplit(raw_address, ',\\s*'), `[`, 3)
> df
raw_address address distict city
1 73 Greenhill Gardens, Wandsworth, London 73 Greenhill Gardens Wandsworth London
2 22 Acacia Heights, Lambeth, London 22 Acacia Heights Lambeth London
3 Street, District, City Street District City
The split is done on ,\\s* in case there is no space or are multiple spaces after a comma.

You could try this
(?<=, )(.+?),
Works with any data set location doesn't have to be in london.

Related

Replace Value & Shift Data Frame If Certain Condition Met

I've scraped data from a source online to create a data frame (df1) with n rows of information pertaining to individuals. It comes in as a single string, and I split the words apart into appropriate columns.
90% of the information is correctly formatted to the proper number of columns in a data frame (6) - however, once in a while there is a row of data with an extra word that is located in the spot of the 4th word from the start of the string. Those lines now have 7 columns and are off-set from everything else in the data frame.
Here is an example:
Num Last-Name First-Name Cat. DOB Location
11 Jackson, Adam L 1982-06-15 USA
2 Pearl, Sam R 1986-11-04 UK
5 Livingston, Steph LL 1983-12-12 USA
7 Thornton, Mark LR 1982-03-26 USA
10 Silver, John RED LL 1983-09-14 USA
df1 = c(" 11 Jackson, Adam L 1982-06-15 USA",
"2 Pearl, Sam R 1986-11-04 UK",
"5 Livingston, Steph LL 1983-12-12 USA",
"7 Thornton, Mark LR 1982-03-26 USA",
"10 Silver, John RED LL 1983-09-14 USA")
You can see item #10 has an extra input added, the color "RED" is inserted into the middle of the string.
I started to run code that used stringr to evaluate how many characters were present in the 4th word, and if it was 3 or greater (every value that will be in the Cat. column is is 1-2 characters), I created a new column at the end of the data frame, assigned the value to it, and if there was no value (i.e. it evaluates to FALSE), input NA. I'm sure I could likely create a massive nested ifelse statement in a dplyr mutate (my personal comfort zone), but I figure there must be a more efficient way to achieve my desired result:
Num Last-Name First-Name Cat. DOB Location Color
11 Jackson, Adam L 1982-06-15 USA NA
2 Pearl, Sam R 1986-11-04 UK NA
5 Livingston, Steph LL 1983-12-12 USA NA
7 Thornton, Mark LR 1982-03-26 USA NA
10 Silver, John LL 1983-09-14 USA RED
I want to find the instances where the 4th word from the start of the string is 3 characters or longer, assign that word or value to a new column at the end of the data frame, and shift the corresponding values in the row to the left to properly align with the others rows of data.

here's a simpler way:
input <- gsub("(.*, \\w+) ((?:\\w){3,})(.*)", "\\1 \\3 \\2", input, TRUE)
input <- gsub("([0-9]\\s\\w+)\\n", "\\1 NA\n", input, TRUE)
the first gsub transposes colors to the end of the string. the second gsub makes use of the fact that unchanged lines will now end with a date and country-code (not a country-code and a color), and simply adds an "NA" to them.
IDEone demo

We could use gsub to remove the extra substrings
v1 <- gsub("([^,]+),(\\s+[[:alpha:]]+)\\s*\\S*(\\s+[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}.*)",
"\\1\\2\\3", trimws(df1))
d1 <- read.table(text=v1, sep="", header=FALSE, stringsAsFactors=FALSE,
col.names = c("Num", "LastName", "FirstName", "Cat", "DOB", "Location"))
d1$Color <- trimws(gsub("^[^,]+,\\s+[[:alpha:]]+|[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}\\s+\\S+$",
"", trimws(df1)))
d1
# Num LastName FirstName Cat DOB Location Color
#1 11 Jackson Adam L 1982-06-15 USA
#2 2 Pearl Sam R 1986-11-04 UK
#3 5 Livingston Steph LL 1983-12-12 USA
#4 7 Thornton Mark LR 1982-03-26 USA
#5 10 Silver John LL 1983-09-14 USA RED

Using strsplit instead of regex:
# split strings in df1 on commas and spaces not preceded by the start of the line
s <- strsplit(df1, '(?<!^)[, ]+', perl = T)
# iterate over s, transpose the result and make it a data.frame
df2 <- data.frame(t(sapply(s, function(x){
# if number of items in row is 6, insert NA, else rearrange
if (length(x) == 6) {c(x, NA)} else {x[c(1:3, 5:7, 4)]}
})))
# add names
names(df2) <- c("Num", "Last-Name", "First-Name", "Cat.", "DOB", "Location", "Color")
df2
# Num Last-Name First-Name Cat. DOB Location Color
# 1 11 Jackson Adam L 1982-06-15 USA <NA>
# 2 2 Pearl Sam R 1986-11-04 UK <NA>
# 3 5 Livingston Steph LL 1983-12-12 USA <NA>
# 4 7 Thornton Mark LR 1982-03-26 USA <NA>
# 5 10 Silver John LL 1983-09-14 USA RED

Break string into several columns using tidyr::extract regex

I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:
sasdic <- data.frame(a=c(
'#1 ANO_CENSO 5. /*Ano do Censo*/',
'#71 TP_SEXO $Char1. /*Sexo*/',
'#72 TP_COR_RACA $Char1. /*Cor/raça*/',
'#74 FK_COD_PAIS_ORIGEM 4. /*Código País de origem*/' ))
I would like for the:
first number ([0-9]+) to go to variable "int_pos"
the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"
The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?
Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this).
All other intermidiate caracters (blank spaces, ".", "/", "" should be disconsidered)
This would be the result
d <- data.frame(int_pos=c(1,72,72,74),
var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
x=c('5','Chart1','$Char1','4'),
label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )
I tryed to construct a regular expression for this. This is what I got so far:
sasdic %>% extract(a, c('int_pos','var_name','x','label'),
"([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label")
-> d
above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.

In the regex used, we are matchng one more more punctuation characters ([[:punct:]]+) i.e. # followed by capturing the numeric part ((\\d+) - this will be our first column of interest), followed by one or more white-space (\\s+), followed by the second capture group (\\S+ - one or more non white-space character i.e. "ANO_CENSO" for the first row), followed by space (\\s+), then we capture the third group (([[:alum:]$]+) - i.e. one or more characters that include the alpha numeric along with $ so as to match $Char1), next we match one or more characters that are not a letter ([^A-Za-z]+- this should get rid of the space and *) and the last part we capture one or more characters that are not * (([^*]+).
sasdic %>%
extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
"[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")
# int_pos var_name x label
#1 1 ANO_CENSO 5 Ano do Censo
#2 71 TP_SEXO $Char1 Sexo
#3 72 TP_COR_RACA $Char1 Cor/raça
#4 74 FK_COD_PAIS_ORIGEM 4 Código País de origem

This is another option, though it uses the data.table package instead of tidyr:
library(data.table)
setDT(sasdic)
# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]
# remove leading "#", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^#","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]
sasdic
# label int_pos var_name x
# 1: Ano do Censo 1 ANO_CENSO 5.
# 2: Sexo 71 TP_SEXO $Char1.
# 3: Cor/raça 72 TP_COR_RACA $Char1.
# 4: Código País de origem 74 FK_COD_PAIS_ORIGEM 4.
This assumes that the "remaining parts" (aside from the label) are space-separated.
This could also be done in one block (which is what I would do):
sasdic[, c("a","label","int_pos","var_name","x") := {
x = tstrsplit(a, "/\\*|\\*/")
x1s = tstrsplit(gsub("^#","",x[[1]])," +")
c(list(NULL), x1s, x[2])
}]

You could use the package unglue :
library(unglue)
unglue_unnest(sasdic, a, "#{int_pos}{=\\s+}{varname}{=\\s+}{x}.{=\\s+}/*{label}*/")
#> int_pos varname x label
#> 1 1 ANO_CENSO 5 Ano do Censo
#> 2 71 TP_SEXO $Char1 Sexo
#> 3 72 TP_COR_RACA $Char1 Cor/ra<e7>a
#> 4 74 FK_COD_PAIS_ORIGEM 4 C<f3>digo Pa<ed>s de origem

R get first letters of double/tripple-barrel surnames in data.frame

I have a dataframe with 2 columns:
> df1
Surname Name
1 The Builder Bob
2 Zeta-Jones Catherine
I want to add a third column "Shortened_Surname" which contains the first letters of all the words in the surname field:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
Note the "-" in the second name. I have barreled surnames separated by spaces and hyphens.
I have tried:
Step1:
> strsplit(unlist(as.character(df1$Surname))," ")
[[1]]
[1] "The" "Builder"
[[2]]
[1] "Zeta-Jones"
My research suggests I could possibly use strtrim as a Step 2, but all I have found is a number of ways how not to do it.

You can target the space, hyphen, and beginning of the line with lookarounds. For instance, you any character (.) not preceded by the beginning of the line, a space, or a hyphen should be substituted to "":
with(df, gsub("(?<!^|[ -]).", "", Surname, perl=TRUE))
[1] "TB" "ZJ"
or
with(df, gsub("(?<=[^ -]).", "", Surname, perl=TRUE))
The second gsub substitutes a blank ("") for any character that is preceded by a character that is not a " " or "-".

You can try this, if the format of the names is as show in the input data:
library(stringr)
df$Shortened_Surname <- sapply(str_extract_all(df$Surname, '[A-Z]{1}'), function(x) paste(x, collapse = ''))
Output is as follows:
Surname Name Shortened_Surname
1 The Builder Bob TB
2 Zeta-Jones Catherine ZJ
If the format of the names is somewhat inconsistent, you will need to modify the above pattern to capture that. You can use |, & operators inside the pattern to combine multiple patterns.

Substring extraction from vector in R

I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:
countries <- c("United States", "Israel", "Canada")
How do I go about passing this vector of character values to extract exact matches from unstructured text.
text.df <- data.frame(ID = c(1:5),
text = c("United States is a match", "Not a match", "Not a match",
"Israel is a match", "Canada is a match"))
In this example, the desired output would be:
ID text
1 United States
4 Israel
5 Canada
So far I have been working with gsub by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!

1. stringr
We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')
library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
# ID text
#1 1 United States
#4 4 Israel
#5 5 Canada
2. base R
Without using any external packages, we can remove the characters other than those found in 'ind'
text.df1$text <- unlist(regmatches(text.df1$text,
gregexpr(indx, text.df1$text)))
3. stringi
We could also use the faster stri_extract from stringi
library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
# ID text1
#1 1 United States
#4 4 Israel
#5 5 Canada

Here's an approach with data.table:
library(data.table)
##
R> data.table(text.df)[
sapply(countries, function(x) grep(x,text),USE.NAMES=F),
list(ID, text = countries)]
ID text
1: 1 United States
2: 4 Israel
3: 5 Canada

Create the pattern, p, and use strapply to extract the match to each component of text returning NA for each unmatched component. Finally remove the NA values using na.omit. This is non-destructive (i.e. text.df is not modified):
library(gsubfn)
p <- paste(countries, collapse = "|")
na.omit(transform(text.df, text = strapply(paste(text), p, empty = NA, simplify = TRUE)))
giving:
ID text
1 1 United States
4 4 Israel
5 5 Canada
Using dplyr it could also be written as follows (using p from above):
library(dplyr)
library(gsubfn)
text.df %>%
mutate(text = strapply(paste(text), p, empty = NA, simplify = TRUE)) %>%
na.omit

How to replace specific characters of a string with tab in R

Having a data frame with a string in each row, I need to replace n'th character into tab. Moreover, there are an inconstant number of spaces before m'th character that I need to convert to tab as well.
For instance having following row:
"00001 000 0 John Smith"
I need to replace the 6th character (space) into tab and replace the spaces between John and Smith into tab as well. For all the rows the last word (Smith) starts from 75th character. So, basically I need to replace all spaces before 78th character into tab.
I need the above row as follows:
"00001<Tab>000 0 John<Tab>Smith"
Thanks for the help.

You could use gsub here.
x <- c('00001 000 0 John Smith',
'00002 000 1 Josh Black',
'00003 000 2 Jane Smith',
'00004 000 3 Jeff Smith')
x <- gsub("(?<=[0-9]{5}) |(?<!\\d) +(?=(?i:[a-z]))", "\t", x, perl=T)
Output
[1] "00001\t000 0 John\tSmith" "00002\t000 1 Josh\tBlack"
[3] "00003\t000 2 Jane\tSmith" "00004\t000 3 Jeff\tSmith"
To actually see the \t in output use cat(x)
00001 000 0 John Smith
00002 000 1 Josh Black
00003 000 2 Jane Smith
00004 000 3 Jeff Smith

Here's one solution if it always starts at 75. First some sample data
#sample data
a <- "00001 000 0 John Smith"
b <- "00001 000 0 John Smith"
Now since you know positions, i'll use substr. To extract the parts, then i'll trim the middle, then you can paste in the tabs.
#extract parts
part1<-substr(c(a,b), 1, 5)
part2<-gsub("\\s*$","",substr(c(a,b), 7, 74))
part3<-substr(c(a,b), 75, 10000L)
#add in tabs
paste(part1, part2, part3, sep="\t")

We Keep Coding

c++ django amazon-web-services regex python-2.7 google-cloud-platform list unit-testing opengl ember.js