I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:
countries <- c("United States", "Israel", "Canada")
How do I go about passing this vector of character values to extract exact matches from unstructured text.
text.df <- data.frame(ID = c(1:5),
text = c("United States is a match", "Not a match", "Not a match",
"Israel is a match", "Canada is a match"))
In this example, the desired output would be:
ID text
1 United States
4 Israel
5 Canada
So far I have been working with gsub by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!
1. stringr
We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')
library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
# ID text
#1 1 United States
#4 4 Israel
#5 5 Canada
2. base R
Without using any external packages, we can remove the characters other than those found in 'ind'
text.df1$text <- unlist(regmatches(text.df1$text,
gregexpr(indx, text.df1$text)))
3. stringi
We could also use the faster stri_extract from stringi
library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
# ID text1
#1 1 United States
#4 4 Israel
#5 5 Canada
Here's an approach with data.table:
library(data.table)
##
R> data.table(text.df)[
sapply(countries, function(x) grep(x,text),USE.NAMES=F),
list(ID, text = countries)]
ID text
1: 1 United States
2: 4 Israel
3: 5 Canada
Create the pattern, p, and use strapply to extract the match to each component of text returning NA for each unmatched component. Finally remove the NA values using na.omit. This is non-destructive (i.e. text.df is not modified):
library(gsubfn)
p <- paste(countries, collapse = "|")
na.omit(transform(text.df, text = strapply(paste(text), p, empty = NA, simplify = TRUE)))
giving:
ID text
1 1 United States
4 4 Israel
5 5 Canada
Using dplyr it could also be written as follows (using p from above):
library(dplyr)
library(gsubfn)
text.df %>%
mutate(text = strapply(paste(text), p, empty = NA, simplify = TRUE)) %>%
na.omit
Related
I'd like to split a vector of character strings (people's names) into two columns (vectors). The problem is some people have a 'two word' last name. I'd like to split the first and last names into two columns. I can slit out and take the first names using the code below but the last name eludes me. (look at obs 29 in the sample set below to get an idea as the Ford has a "last name" of Pantera L that must be kept together)
What I have attempted to do so far;
x<-rownames(mtcars)
unlist(strsplit(x, " .*"))
What I'd like it to look like:
MANUF MAKE
27 Porsche 914-2
28 Lotus Europa
29 Ford Pantera L
30 Ferrari Dino
31 Maserati Bora
32 Volvo 142E
The regular expression rexp matches the word at the start of the string, an optional space, then the rest of the string. The parenthesis are subexpressions accessed as backreferences \\1 and \\2.
rexp <- "^(\\w+)\\s?(.*)$"
y <- data.frame(MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x))
tail(y)
# MANUF MAKE
# 27 Porsche 914-2
# 28 Lotus Europa
# 29 Ford Pantera L
# 30 Ferrari Dino
# 31 Maserati Bora
# 32 Volvo 142E
For me, Hadley's colsplit function in the reshape2 package is the most intuitive for this purpose. Joshua's way is more general (ie can be used wherever a regex could be used) and flexible (if you want to change the specification); but the colsplit function is perfectly suited to this specific setting:
library(reshape2)
y <- colsplit(x," ",c("MANUF","MAKE"))
tail(y)
# MANUF MAKE
#27 Porsche 914-2
#28 Lotus Europa
#29 Ford Pantera L
#30 Ferrari Dino
#31 Maserati Bora
#32 Volvo 142E
Here are two approaches:
1) strsplit. This approach uses only functions in the core of R and no complex regular expressions. Replace the first space with a semicolon (using sub and not gsub), strsplit on the semicolon and then rbind it into a 2 column matrix:
mat <- do.call("rbind", strsplit(sub(" ", ";", x), ";"))
colnames(mat) <- c("MANUF", "MAKE")
2) strapply in gsubfn package Here is a one-liner using strapply in the gsubfn package. The two parenthesized portions of the regular expression capture the desired first and second columns respectively and the function (which is specified in formula notation -- its the same as specifying function(x, y) c(MANUF = x, MAKE = y)) grabs them and adds names. The simplify=rbind argument is to used to turn it into a matrix as in the prior solution.
library(gsubfn)
mat <- strapply(x, "(\\S+)\\s+(.*)", ~ c(MANUF = x, MAKE = y), simplify = rbind)
Note: In either case a "character" matrix, mat, is returned. If a data frame of "character" columns is desired then add this:
DF <- as.data.frame(mat, stringsAsFactors = FALSE)
Omit the stringsAsFactors argument if "factor" columns are wanted.
Yet another way of doing it:
str_split from stringr will handle the split, but returns it in a different form (a list, like strsplit does). Manipulating into the correct form is straightforward though.
library(stringr)
split_x <- str_split(x, " ", 2)
(y <- data.frame(
MANUF = sapply(split_x, head, n = 1),
MAKE = sapply(split_x, tail, n = 1)
))
Or, as Hadley mentioned in the comments, with str_split_fixed.
y <- as.data.frame(str_split_fixed(x, " ", 2))
colnames(y) <- c("MANUF", "MAKE")
y
If you can do pattern and group matching, I'd try something like this (untested):
\s+(.*)\s+(.*)
You can also use tidyr::extract after converting your vector into a data frame first - I think this would also be the more modern version of older solutions with reshape2
library(tidyr)
## first convert into a data frame
x <- data.frame(x = rownames(mtcars))
## use extract, and for example Joshua's regex
res <- extract(x, col = x, into = c("MANUF", "MAKE"), regex = "^(\\w+)\\s?(.*)$")
head(res)
#> MANUF MAKE
#> 1 Mazda RX4
#> 2 Mazda RX4 Wag
#> 3 Datsun 710
#> 4 Hornet 4 Drive
#> 5 Hornet Sportabout
#> 6 Valiant
I think searching for [^\s]+ would work. Untested.
I've scraped data from a source online to create a data frame (df1) with n rows of information pertaining to individuals. It comes in as a single string, and I split the words apart into appropriate columns.
90% of the information is correctly formatted to the proper number of columns in a data frame (6) - however, once in a while there is a row of data with an extra word that is located in the spot of the 4th word from the start of the string. Those lines now have 7 columns and are off-set from everything else in the data frame.
Here is an example:
Num Last-Name First-Name Cat. DOB Location
11 Jackson, Adam L 1982-06-15 USA
2 Pearl, Sam R 1986-11-04 UK
5 Livingston, Steph LL 1983-12-12 USA
7 Thornton, Mark LR 1982-03-26 USA
10 Silver, John RED LL 1983-09-14 USA
df1 = c(" 11 Jackson, Adam L 1982-06-15 USA",
"2 Pearl, Sam R 1986-11-04 UK",
"5 Livingston, Steph LL 1983-12-12 USA",
"7 Thornton, Mark LR 1982-03-26 USA",
"10 Silver, John RED LL 1983-09-14 USA")
You can see item #10 has an extra input added, the color "RED" is inserted into the middle of the string.
I started to run code that used stringr to evaluate how many characters were present in the 4th word, and if it was 3 or greater (every value that will be in the Cat. column is is 1-2 characters), I created a new column at the end of the data frame, assigned the value to it, and if there was no value (i.e. it evaluates to FALSE), input NA. I'm sure I could likely create a massive nested ifelse statement in a dplyr mutate (my personal comfort zone), but I figure there must be a more efficient way to achieve my desired result:
Num Last-Name First-Name Cat. DOB Location Color
11 Jackson, Adam L 1982-06-15 USA NA
2 Pearl, Sam R 1986-11-04 UK NA
5 Livingston, Steph LL 1983-12-12 USA NA
7 Thornton, Mark LR 1982-03-26 USA NA
10 Silver, John LL 1983-09-14 USA RED
I want to find the instances where the 4th word from the start of the string is 3 characters or longer, assign that word or value to a new column at the end of the data frame, and shift the corresponding values in the row to the left to properly align with the others rows of data.
here's a simpler way:
input <- gsub("(.*, \\w+) ((?:\\w){3,})(.*)", "\\1 \\3 \\2", input, TRUE)
input <- gsub("([0-9]\\s\\w+)\\n", "\\1 NA\n", input, TRUE)
the first gsub transposes colors to the end of the string. the second gsub makes use of the fact that unchanged lines will now end with a date and country-code (not a country-code and a color), and simply adds an "NA" to them.
IDEone demo
We could use gsub to remove the extra substrings
v1 <- gsub("([^,]+),(\\s+[[:alpha:]]+)\\s*\\S*(\\s+[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}.*)",
"\\1\\2\\3", trimws(df1))
d1 <- read.table(text=v1, sep="", header=FALSE, stringsAsFactors=FALSE,
col.names = c("Num", "LastName", "FirstName", "Cat", "DOB", "Location"))
d1$Color <- trimws(gsub("^[^,]+,\\s+[[:alpha:]]+|[[:alpha:]]+\\s+\\d{4}-\\d{2}-\\d{2}\\s+\\S+$",
"", trimws(df1)))
d1
# Num LastName FirstName Cat DOB Location Color
#1 11 Jackson Adam L 1982-06-15 USA
#2 2 Pearl Sam R 1986-11-04 UK
#3 5 Livingston Steph LL 1983-12-12 USA
#4 7 Thornton Mark LR 1982-03-26 USA
#5 10 Silver John LL 1983-09-14 USA RED
Using strsplit instead of regex:
# split strings in df1 on commas and spaces not preceded by the start of the line
s <- strsplit(df1, '(?<!^)[, ]+', perl = T)
# iterate over s, transpose the result and make it a data.frame
df2 <- data.frame(t(sapply(s, function(x){
# if number of items in row is 6, insert NA, else rearrange
if (length(x) == 6) {c(x, NA)} else {x[c(1:3, 5:7, 4)]}
})))
# add names
names(df2) <- c("Num", "Last-Name", "First-Name", "Cat.", "DOB", "Location", "Color")
df2
# Num Last-Name First-Name Cat. DOB Location Color
# 1 11 Jackson Adam L 1982-06-15 USA <NA>
# 2 2 Pearl Sam R 1986-11-04 UK <NA>
# 3 5 Livingston Steph LL 1983-12-12 USA <NA>
# 4 7 Thornton Mark LR 1982-03-26 USA <NA>
# 5 10 Silver John LL 1983-09-14 USA RED
I'm trying to use Flodel's answer here (extra commas in csv causing problems) in order to import some messy CSV data, but I'm having trouble implementing the solution.
When I have more columns than three, I don't know how to get the text and extra comma into my desired column. I'm pretty sure the problem is in my pattern; I just don't know how to fix it.
file <- textConnection("123, hi, NAME1, EMAIL1#ADDRESS.COM
111, hi, NAME2, EMAIL2#ADRESS.ME
699, hi, FIRST M. LAST, Jr., EMAIL4#ADDRESS.GOV")
lines <- readLines(file)
pattern <- "^(\\d+), (.*), (.*), \\b(.*)$"
matches <- regexec(pattern, lines)
bad.rows <- which(sapply(matches, length) == 1L)
if (length(bad.rows) > 0L) stop(paste("bad row: ", lines[bad.rows]))
data <- regmatches(lines, matches)
as.data.frame(matrix(unlist(data), ncol = 5L, byrow = TRUE)[, -1L])
which gives me
V1 V2 V3 V4
123 hi NAME1 EMAIL1#ADDRESS.COM
111 hi NAME2 EMAIL2#ADRESS.ME
699 hi, FIRST M. LAST Jr. EMAIL4#ADDRESS.GOV
I'd like to see:
V1 V2 V3 V4
123 hi NAME1 EMAIL1#ADDRESS.COM
111 hi NAME2 EMAIL2#ADRESS.ME
699 hi FIRST M. LAST, Jr. EMAIL4#ADDRESS.GOV
If you're more explicit with what you want to match on, you might get better results. If column two will always only have a single string that does not include a comma, you can use:
pattern <- "^(\\d+), ([^,]+), (.*), \\b(.*)$"
In my experience, making your regular expression as explicit as you can first and then generalizing when that stops working is the best approach. e.g. if the second string is always hi include that in your regex.
pattern <- "^(\\d+), (hi), (.*), \\b(.*)$"
I have a bunch of text in a dataframe (df) that usually contains three lines of an address in 1 column and my goal is to extract the district (central part of the text), eg:
73 Greenhill Gardens, Wandsworth, London
22 Acacia Heights, Lambeth, London
Fortunately for me in 95% of cases the person inputing the data has used commas to separate the text I want, which 100% of the time ends ", London" (ie comma space London). To state things clearly therefore my goal is to extract the text BEFORE ", London" and AFTER the preceding comma
My desired output is:
Wandsworth
Lambeth
I can manage to extract the part before:
df$extraction <- sub('.*,\\s*','',address)
and after
df$extraction <- sub('.*,\\s*','',address)
But not the middle part that I need. Can someone please help?
Many Thanks!
You could save yourself the headache of a regular expression and treat the vector like a CSV, using a file reading function to extract the relevant part. We can use read.csv(), taking advantage of the fact that colClasses can be used to drop columns.
address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London"
)
read.csv(text = address, colClasses = c("NULL", "character", "NULL"),
header = FALSE, strip.white = TRUE)[[1L]]
# [1] "Wandsworth" "Lambeth"
Or we could use fread(). Its select argument is nice and it strips white space automatically.
data.table::fread(paste(address, collapse = "\n"),
select = 2, header = FALSE)[[1L]]
# [1] "Wandsworth" "Lambeth"
Here are a couple of approaches:
# target ", London" and the start of the string
# up until the first comma followed by a space,
# and replace with ""
gsub("^.+?, |, London", "", address)
#[1] "Wandsworth" "Lambeth"
Or
# target the whole string, but use a capture group
# for the text before ", London" and after the first comma.
# replace the string with the captured group.
sub(".+, (.*), London", "\\1", address)
#[1] "Wandsworth" "Lambeth"
Here are two options that aren't dependent on the city name being the same. The first uses a regex pattern with stringr::str_extract():
raw_address <- c(
"73 Greenhill Gardens, Wandsworth, London",
"22 Acacia Heights, Lambeth, London",
"Street, District, City"
)
df <- data.frame(raw_address, stringsAsFactors = FALSE)
df$distict = stringr::str_extract(raw_address, '(?<=,)[^,]+(?=,)')
> df
raw_address distict
1 73 Greenhill Gardens, Wandsworth, London Wandsworth
2 22 Acacia Heights, Lambeth, London Lambeth
3 Street, District, City District
The second uses strsplit() and makes getting the other elements of the address easier:
df$address <- sapply(strsplit(raw_address, ',\\s*'), `[`, 1)
df$distict <- sapply(strsplit(raw_address, ',\\s*'), `[`, 2)
df$city <- sapply(strsplit(raw_address, ',\\s*'), `[`, 3)
> df
raw_address address distict city
1 73 Greenhill Gardens, Wandsworth, London 73 Greenhill Gardens Wandsworth London
2 22 Acacia Heights, Lambeth, London 22 Acacia Heights Lambeth London
3 Street, District, City Street District City
The split is done on ,\\s* in case there is no space or are multiple spaces after a comma.
You could try this
(?<=, )(.+?),
Works with any data set location doesn't have to be in london.
I'd like to split a vector of character strings (people's names) into two columns (vectors). The problem is some people have a 'two word' last name. I'd like to split the first and last names into two columns. I can slit out and take the first names using the code below but the last name eludes me. (look at obs 29 in the sample set below to get an idea as the Ford has a "last name" of Pantera L that must be kept together)
What I have attempted to do so far;
x<-rownames(mtcars)
unlist(strsplit(x, " .*"))
What I'd like it to look like:
MANUF MAKE
27 Porsche 914-2
28 Lotus Europa
29 Ford Pantera L
30 Ferrari Dino
31 Maserati Bora
32 Volvo 142E
The regular expression rexp matches the word at the start of the string, an optional space, then the rest of the string. The parenthesis are subexpressions accessed as backreferences \\1 and \\2.
rexp <- "^(\\w+)\\s?(.*)$"
y <- data.frame(MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x))
tail(y)
# MANUF MAKE
# 27 Porsche 914-2
# 28 Lotus Europa
# 29 Ford Pantera L
# 30 Ferrari Dino
# 31 Maserati Bora
# 32 Volvo 142E
For me, Hadley's colsplit function in the reshape2 package is the most intuitive for this purpose. Joshua's way is more general (ie can be used wherever a regex could be used) and flexible (if you want to change the specification); but the colsplit function is perfectly suited to this specific setting:
library(reshape2)
y <- colsplit(x," ",c("MANUF","MAKE"))
tail(y)
# MANUF MAKE
#27 Porsche 914-2
#28 Lotus Europa
#29 Ford Pantera L
#30 Ferrari Dino
#31 Maserati Bora
#32 Volvo 142E
Here are two approaches:
1) strsplit. This approach uses only functions in the core of R and no complex regular expressions. Replace the first space with a semicolon (using sub and not gsub), strsplit on the semicolon and then rbind it into a 2 column matrix:
mat <- do.call("rbind", strsplit(sub(" ", ";", x), ";"))
colnames(mat) <- c("MANUF", "MAKE")
2) strapply in gsubfn package Here is a one-liner using strapply in the gsubfn package. The two parenthesized portions of the regular expression capture the desired first and second columns respectively and the function (which is specified in formula notation -- its the same as specifying function(x, y) c(MANUF = x, MAKE = y)) grabs them and adds names. The simplify=rbind argument is to used to turn it into a matrix as in the prior solution.
library(gsubfn)
mat <- strapply(x, "(\\S+)\\s+(.*)", ~ c(MANUF = x, MAKE = y), simplify = rbind)
Note: In either case a "character" matrix, mat, is returned. If a data frame of "character" columns is desired then add this:
DF <- as.data.frame(mat, stringsAsFactors = FALSE)
Omit the stringsAsFactors argument if "factor" columns are wanted.
Yet another way of doing it:
str_split from stringr will handle the split, but returns it in a different form (a list, like strsplit does). Manipulating into the correct form is straightforward though.
library(stringr)
split_x <- str_split(x, " ", 2)
(y <- data.frame(
MANUF = sapply(split_x, head, n = 1),
MAKE = sapply(split_x, tail, n = 1)
))
Or, as Hadley mentioned in the comments, with str_split_fixed.
y <- as.data.frame(str_split_fixed(x, " ", 2))
colnames(y) <- c("MANUF", "MAKE")
y
If you can do pattern and group matching, I'd try something like this (untested):
\s+(.*)\s+(.*)
You can also use tidyr::extract after converting your vector into a data frame first - I think this would also be the more modern version of older solutions with reshape2
library(tidyr)
## first convert into a data frame
x <- data.frame(x = rownames(mtcars))
## use extract, and for example Joshua's regex
res <- extract(x, col = x, into = c("MANUF", "MAKE"), regex = "^(\\w+)\\s?(.*)$")
head(res)
#> MANUF MAKE
#> 1 Mazda RX4
#> 2 Mazda RX4 Wag
#> 3 Datsun 710
#> 4 Hornet 4 Drive
#> 5 Hornet Sportabout
#> 6 Valiant
I think searching for [^\s]+ would work. Untested.