Splitting a string on the first space - regex

I'd like to split a vector of character strings (people's names) into two columns (vectors). The problem is some people have a 'two word' last name. I'd like to split the first and last names into two columns. I can slit out and take the first names using the code below but the last name eludes me. (look at obs 29 in the sample set below to get an idea as the Ford has a "last name" of Pantera L that must be kept together)
What I have attempted to do so far;
x<-rownames(mtcars)
unlist(strsplit(x, " .*"))
What I'd like it to look like:
MANUF MAKE
27 Porsche 914-2
28 Lotus Europa
29 Ford Pantera L
30 Ferrari Dino
31 Maserati Bora
32 Volvo 142E

The regular expression rexp matches the word at the start of the string, an optional space, then the rest of the string. The parenthesis are subexpressions accessed as backreferences \\1 and \\2.
rexp <- "^(\\w+)\\s?(.*)$"
y <- data.frame(MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x))
tail(y)
# MANUF MAKE
# 27 Porsche 914-2
# 28 Lotus Europa
# 29 Ford Pantera L
# 30 Ferrari Dino
# 31 Maserati Bora
# 32 Volvo 142E

For me, Hadley's colsplit function in the reshape2 package is the most intuitive for this purpose. Joshua's way is more general (ie can be used wherever a regex could be used) and flexible (if you want to change the specification); but the colsplit function is perfectly suited to this specific setting:
library(reshape2)
y <- colsplit(x," ",c("MANUF","MAKE"))
tail(y)
# MANUF MAKE
#27 Porsche 914-2
#28 Lotus Europa
#29 Ford Pantera L
#30 Ferrari Dino
#31 Maserati Bora
#32 Volvo 142E

Here are two approaches:
1) strsplit. This approach uses only functions in the core of R and no complex regular expressions. Replace the first space with a semicolon (using sub and not gsub), strsplit on the semicolon and then rbind it into a 2 column matrix:
mat <- do.call("rbind", strsplit(sub(" ", ";", x), ";"))
colnames(mat) <- c("MANUF", "MAKE")
2) strapply in gsubfn package Here is a one-liner using strapply in the gsubfn package. The two parenthesized portions of the regular expression capture the desired first and second columns respectively and the function (which is specified in formula notation -- its the same as specifying function(x, y) c(MANUF = x, MAKE = y)) grabs them and adds names. The simplify=rbind argument is to used to turn it into a matrix as in the prior solution.
library(gsubfn)
mat <- strapply(x, "(\\S+)\\s+(.*)", ~ c(MANUF = x, MAKE = y), simplify = rbind)
Note: In either case a "character" matrix, mat, is returned. If a data frame of "character" columns is desired then add this:
DF <- as.data.frame(mat, stringsAsFactors = FALSE)
Omit the stringsAsFactors argument if "factor" columns are wanted.

Yet another way of doing it:
str_split from stringr will handle the split, but returns it in a different form (a list, like strsplit does). Manipulating into the correct form is straightforward though.
library(stringr)
split_x <- str_split(x, " ", 2)
(y <- data.frame(
MANUF = sapply(split_x, head, n = 1),
MAKE = sapply(split_x, tail, n = 1)
))
Or, as Hadley mentioned in the comments, with str_split_fixed.
y <- as.data.frame(str_split_fixed(x, " ", 2))
colnames(y) <- c("MANUF", "MAKE")
y

If you can do pattern and group matching, I'd try something like this (untested):
\s+(.*)\s+(.*)

You can also use tidyr::extract after converting your vector into a data frame first - I think this would also be the more modern version of older solutions with reshape2
library(tidyr)
## first convert into a data frame
x <- data.frame(x = rownames(mtcars))
## use extract, and for example Joshua's regex
res <- extract(x, col = x, into = c("MANUF", "MAKE"), regex = "^(\\w+)\\s?(.*)$")
head(res)
#> MANUF MAKE
#> 1 Mazda RX4
#> 2 Mazda RX4 Wag
#> 3 Datsun 710
#> 4 Hornet 4 Drive
#> 5 Hornet Sportabout
#> 6 Valiant

I think searching for [^\s]+ would work. Untested.

Related

How to split a string of characters by the first white space? [duplicate]

I'd like to split a vector of character strings (people's names) into two columns (vectors). The problem is some people have a 'two word' last name. I'd like to split the first and last names into two columns. I can slit out and take the first names using the code below but the last name eludes me. (look at obs 29 in the sample set below to get an idea as the Ford has a "last name" of Pantera L that must be kept together)
What I have attempted to do so far;
x<-rownames(mtcars)
unlist(strsplit(x, " .*"))
What I'd like it to look like:
MANUF MAKE
27 Porsche 914-2
28 Lotus Europa
29 Ford Pantera L
30 Ferrari Dino
31 Maserati Bora
32 Volvo 142E
The regular expression rexp matches the word at the start of the string, an optional space, then the rest of the string. The parenthesis are subexpressions accessed as backreferences \\1 and \\2.
rexp <- "^(\\w+)\\s?(.*)$"
y <- data.frame(MANUF=sub(rexp,"\\1",x), MAKE=sub(rexp,"\\2",x))
tail(y)
# MANUF MAKE
# 27 Porsche 914-2
# 28 Lotus Europa
# 29 Ford Pantera L
# 30 Ferrari Dino
# 31 Maserati Bora
# 32 Volvo 142E
For me, Hadley's colsplit function in the reshape2 package is the most intuitive for this purpose. Joshua's way is more general (ie can be used wherever a regex could be used) and flexible (if you want to change the specification); but the colsplit function is perfectly suited to this specific setting:
library(reshape2)
y <- colsplit(x," ",c("MANUF","MAKE"))
tail(y)
# MANUF MAKE
#27 Porsche 914-2
#28 Lotus Europa
#29 Ford Pantera L
#30 Ferrari Dino
#31 Maserati Bora
#32 Volvo 142E
Here are two approaches:
1) strsplit. This approach uses only functions in the core of R and no complex regular expressions. Replace the first space with a semicolon (using sub and not gsub), strsplit on the semicolon and then rbind it into a 2 column matrix:
mat <- do.call("rbind", strsplit(sub(" ", ";", x), ";"))
colnames(mat) <- c("MANUF", "MAKE")
2) strapply in gsubfn package Here is a one-liner using strapply in the gsubfn package. The two parenthesized portions of the regular expression capture the desired first and second columns respectively and the function (which is specified in formula notation -- its the same as specifying function(x, y) c(MANUF = x, MAKE = y)) grabs them and adds names. The simplify=rbind argument is to used to turn it into a matrix as in the prior solution.
library(gsubfn)
mat <- strapply(x, "(\\S+)\\s+(.*)", ~ c(MANUF = x, MAKE = y), simplify = rbind)
Note: In either case a "character" matrix, mat, is returned. If a data frame of "character" columns is desired then add this:
DF <- as.data.frame(mat, stringsAsFactors = FALSE)
Omit the stringsAsFactors argument if "factor" columns are wanted.
Yet another way of doing it:
str_split from stringr will handle the split, but returns it in a different form (a list, like strsplit does). Manipulating into the correct form is straightforward though.
library(stringr)
split_x <- str_split(x, " ", 2)
(y <- data.frame(
MANUF = sapply(split_x, head, n = 1),
MAKE = sapply(split_x, tail, n = 1)
))
Or, as Hadley mentioned in the comments, with str_split_fixed.
y <- as.data.frame(str_split_fixed(x, " ", 2))
colnames(y) <- c("MANUF", "MAKE")
y
If you can do pattern and group matching, I'd try something like this (untested):
\s+(.*)\s+(.*)
You can also use tidyr::extract after converting your vector into a data frame first - I think this would also be the more modern version of older solutions with reshape2
library(tidyr)
## first convert into a data frame
x <- data.frame(x = rownames(mtcars))
## use extract, and for example Joshua's regex
res <- extract(x, col = x, into = c("MANUF", "MAKE"), regex = "^(\\w+)\\s?(.*)$")
head(res)
#> MANUF MAKE
#> 1 Mazda RX4
#> 2 Mazda RX4 Wag
#> 3 Datsun 710
#> 4 Hornet 4 Drive
#> 5 Hornet Sportabout
#> 6 Valiant
I think searching for [^\s]+ would work. Untested.

Break string into several columns using tidyr::extract regex

I'm trying to break a string vector into several variables using regular expressions in R, preferably in a dplyr-tidyr way using the tidyr::extract command. For insctance in the vector bellow:
sasdic <- data.frame(a=c(
'#1 ANO_CENSO 5. /*Ano do Censo*/',
'#71 TP_SEXO $Char1. /*Sexo*/',
'#72 TP_COR_RACA $Char1. /*Cor/raça*/',
'#74 FK_COD_PAIS_ORIGEM 4. /*Código País de origem*/' ))
I would like for the:
first number ([0-9]+) to go to variable "int_pos"
the variable name connected by undersline ([a-zA-Z_]+) to go to variable "var_name"
The second number or the term $Char1 (could be $Char2, etc) to go to var "x". I figured ([0-9]+|$Char[0-9]+) could select this?
Lastly, whatever comes in between "/* ... /" to go to variable "label" (don´t know the regex for this).
All other intermidiate caracters (blank spaces, ".", "/", "" should be disconsidered)
This would be the result
d <- data.frame(int_pos=c(1,72,72,74),
var_name=c('ANO_CENSO','TP_SEXO','TP_COR_RACA','FK_COD_PAIS_ORIGEM'),
x=c('5','Chart1','$Char1','4'),
label=c('Ano do Censo','Sexo','Cor/raça','Código País de origem') )
I tryed to construct a regular expression for this. This is what I got so far:
sasdic %>% extract(a, c('int_pos','var_name','x','label'),
"([0-9]+)([a-zA-Z_]+)([0-9]+|$Char[0-9]+)(something to get the label")
-> d
above the regular expression is incomplete. Also, I don't know hot to make explicit in the extract command syntax, what are the parts to be recovered and what are the parts to leave out.
In the regex used, we are matchng one more more punctuation characters ([[:punct:]]+) i.e. # followed by capturing the numeric part ((\\d+) - this will be our first column of interest), followed by one or more white-space (\\s+), followed by the second capture group (\\S+ - one or more non white-space character i.e. "ANO_CENSO" for the first row), followed by space (\\s+), then we capture the third group (([[:alum:]$]+) - i.e. one or more characters that include the alpha numeric along with $ so as to match $Char1), next we match one or more characters that are not a letter ([^A-Za-z]+- this should get rid of the space and *) and the last part we capture one or more characters that are not * (([^*]+).
sasdic %>%
extract(a, into=c('int_pos', 'var_name', 'x', 'label'),
"[[:punct:]](\\d+)\\s+(\\S+)\\s+([[:alnum:]$]+)[^A-Za-z]+([^*]+)")
# int_pos var_name x label
#1 1 ANO_CENSO 5 Ano do Censo
#2 71 TP_SEXO $Char1 Sexo
#3 72 TP_COR_RACA $Char1 Cor/raça
#4 74 FK_COD_PAIS_ORIGEM 4 Código País de origem
This is another option, though it uses the data.table package instead of tidyr:
library(data.table)
setDT(sasdic)
# split label
sasdic[, c("V1","label") := tstrsplit(a, "/\\*|\\*/")]
# remove leading "#", split remaining parts
sasdic[, c("int_pos","var_name","x") := tstrsplit(gsub("^#","",V1)," +")]
# remove unneeded columns
sasdic[, c("a","V1") := NULL]
sasdic
# label int_pos var_name x
# 1: Ano do Censo 1 ANO_CENSO 5.
# 2: Sexo 71 TP_SEXO $Char1.
# 3: Cor/raça 72 TP_COR_RACA $Char1.
# 4: Código País de origem 74 FK_COD_PAIS_ORIGEM 4.
This assumes that the "remaining parts" (aside from the label) are space-separated.
This could also be done in one block (which is what I would do):
sasdic[, c("a","label","int_pos","var_name","x") := {
x = tstrsplit(a, "/\\*|\\*/")
x1s = tstrsplit(gsub("^#","",x[[1]])," +")
c(list(NULL), x1s, x[2])
}]
You could use the package unglue :
library(unglue)
unglue_unnest(sasdic, a, "#{int_pos}{=\\s+}{varname}{=\\s+}{x}.{=\\s+}/*{label}*/")
#> int_pos varname x label
#> 1 1 ANO_CENSO 5 Ano do Censo
#> 2 71 TP_SEXO $Char1 Sexo
#> 3 72 TP_COR_RACA $Char1 Cor/ra<e7>a
#> 4 74 FK_COD_PAIS_ORIGEM 4 C<f3>digo Pa<ed>s de origem

Substring extraction from vector in R

I am trying to extract substrings from a unstructured text. For example, assume a vector of country names:
countries <- c("United States", "Israel", "Canada")
How do I go about passing this vector of character values to extract exact matches from unstructured text.
text.df <- data.frame(ID = c(1:5),
text = c("United States is a match", "Not a match", "Not a match",
"Israel is a match", "Canada is a match"))
In this example, the desired output would be:
ID text
1 United States
4 Israel
5 Canada
So far I have been working with gsub by where I remove all non-matches and then eliminate then remove rows with empty values. I have also been working with str_extract from the stringr package, but haven't had success getting the arugments for the regular expression correct. Any assistance would be greatly appreciated!
1. stringr
We could first subset the 'text.df' using the 'indx' (formed from collapsing the 'countries' vector) as pattern in 'grep' and then use 'str_extract' the get the pattern elements from the 'text' column, assign that to 'text' column of the subset dataset ('text.df1')
library(stringr)
indx <- paste(countries, collapse="|")
text.df1 <- text.df[grep(indx, text.df$text),]
text.df1$text <- str_extract(text.df1$text, indx)
text.df1
# ID text
#1 1 United States
#4 4 Israel
#5 5 Canada
2. base R
Without using any external packages, we can remove the characters other than those found in 'ind'
text.df1$text <- unlist(regmatches(text.df1$text,
gregexpr(indx, text.df1$text)))
3. stringi
We could also use the faster stri_extract from stringi
library(stringi)
na.omit(within(text.df, text1<- stri_extract(text, regex=indx)))[-2]
# ID text1
#1 1 United States
#4 4 Israel
#5 5 Canada
Here's an approach with data.table:
library(data.table)
##
R> data.table(text.df)[
sapply(countries, function(x) grep(x,text),USE.NAMES=F),
list(ID, text = countries)]
ID text
1: 1 United States
2: 4 Israel
3: 5 Canada
Create the pattern, p, and use strapply to extract the match to each component of text returning NA for each unmatched component. Finally remove the NA values using na.omit. This is non-destructive (i.e. text.df is not modified):
library(gsubfn)
p <- paste(countries, collapse = "|")
na.omit(transform(text.df, text = strapply(paste(text), p, empty = NA, simplify = TRUE)))
giving:
ID text
1 1 United States
4 4 Israel
5 5 Canada
Using dplyr it could also be written as follows (using p from above):
library(dplyr)
library(gsubfn)
text.df %>%
mutate(text = strapply(paste(text), p, empty = NA, simplify = TRUE)) %>%
na.omit

regex variable substitution in "replacement" argument

I have a string in R. I want to find part of the string and append a variable number of zeroes. For example, I have 1 2 3. Sometimes I want it to be 1 20 3; sometimes I want it to be 1 2000 3. If I store the number of appended zeroes in a variable, how can I use it in the "replacement" part of a sub command?
I have in mind code like this:
s <- '1 2 3'
z <- '3'
sub('(\\s\\d)(\\s.*)', '\\10{z}\\2', s)
This code returns 1 20{z} 3. But I want 1 2000 3. How can I get this sort of result?
One way is
s <- '1 2 3'
z <- '3'
zx <- paste(rep(0, z), collapse = '')
sub('(\\s\\d)(\\s.*)', paste0('\\1', zx, '\\2'), s)
but this is a little clunky.
Try concatenate operator from stringi package:
require(stringi)
"abc"%stri+%"123abc"
## [1] "abc123abc"
Your approach to create the replacement string zx is pretty good. However, you can improve your sub command. If you use lookbehind and lookahead instead of matching groups, you don't need to create a new replacement string. You can use zx directly.
sub("(?<=\\s\\d)(?=\\s)", zx, s, perl = TRUE)
# [1] "1 2000 3"

How to extract a part from a string in R

I have a problem when I tried to obtain a numeric part in R. The original strings, for example, is "buy 1000 shares of Google at 1100 GBP"
I need to extract the number of the shares (1000) and the price (1100) separately. Besides, I need to extract the number of the stock, which always appears after "shares of".
I know that sub and gsub can replace string, but what commands should I use to extract part of a string?
1) This extracts all numbers in order:
s <- "buy 1000 shares of Google at 1100 GBP"
library(gsubfn)
strapplyc(s, "[0-9.]+", simplify = as.numeric)
giving:
[1] 1000 1100
2) If the numbers can be in any order but if the number of shares is always followed by the word "shares" and the price is always followed by GBP then:
strapplyc(s, "(\\d+) shares", simplify = as.numeric) # 1000
strapplyc(s, "([0-9.]+) GBP", simplify = as.numeric) # 1100
The portion of the string matched by the part of the regular expression within parens is returned.
3) If the string is known to be of the form: X shares of Y at Z GBP then X, Y and Z can be extracted like this:
strapplyc(s, "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = c)
ADDED Modified pattern to allow either digits or a dot. Also added (3) above and the following:
strapply(c(s, s), "[0-9.]+", as.numeric)
strapply(c(s, s), "[0-9.]+", as.numeric, simplify = rbind) # if ea has same no of matches
strapply(c(s, s), "(\\d+) shares", as.numeric, simplify = c)
strapply(c(s, s), "([0-9.]+) GBP", as.numeric, simplify = c)
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP")
strapplyc(c(s, s), "(\\d+) shares of (.+) at ([0-9.]+) GBP", simplify = rbind)
You can use the sub function:
s <- "buy 1000 shares of Google at 1100 GBP"
# the number of shares
sub(".* (\\d+) shares.*", "\\1", s)
# [1] "1000"
# the stock
sub(".*shares of (\\w+) .*", "\\1", s)
# [1] "Google"
# the price
sub(".* at (\\d+) .*", "\\1", s)
# [1] "1100"
You can also use gregexpr and regmatches to extract all substrings at once:
regmatches(s, gregexpr("\\d+(?= shares)|(?<=shares of )\\w+|(?<= at )\\d+",
s, perl = TRUE))
# [[1]]
# [1] "1000" "Google" "1100"
I feel compelled to include the obligatory stringr solution as well.
library(stringr)
s <- "buy 1000 shares of Google at 1100 GBP"
str_match(s, "([0-9]+) shares")[2]
[1] "1000"
str_match(s, "([0-9]+) GBP")[2]
[1] "1100"
If you want to extract all digits from text use this function from stringi package.
"Nd" is the class of decimal digits.
stri_extract_all_charclass(c(123,43,"66ala123","kot"),"\\p{Nd}")
[[1]]
[1] "123"
[[2]]
[1] "43"
[[3]]
[1] "66" "123"
[[4]]
[1] NA
Please note that here 66 and 123 numbers are extracted separatly.