Comparing two version of the same string - regex

I would like to write a function that compare two string in R. More precisely, if a have this data :
data <- list(
"First sentence.",
"Very first sentence.",
"Very first and only one sentences."
)
I would like the output to be :
[1] "Very" " and only one sentences"
My output is built by all substring that is not included in the previous one. For example:
2nd vs 1st, remove matching string - "first sentence." - from the 2nd, so result is "Very".
# "First sentence."
# "Very first sentence."
# match: ^^^^^^^^^^^^^^^
Now compare 3rd vs 2nd, remove matching string - "very first" - from 3rd , so result is " and only one sentences".
# "Very first sentence."
# "Very first and only one sentences."
# match: ^^^^^^^^^^
Then compare 4th vs 3rd, etc...
So based on this example my output should be:
c("Very", " and only one sentences")
# [1] "Very" " and only one sentences"

Here's a tidyverse approach:
library(dplyr)
library(tidyr)
# put data in a data.frame
data_frame(string = unlist(data)) %>%
# add ID column so we can recombine later
add_rownames('id') %>%
# add a lagged column to compare against
mutate(string2 = lag(string)) %>%
# break strings into words
separate_rows(string) %>%
# evaluate the following calls rowwise (until regrouped)
rowwise() %>%
# chop to rows with a string to compare against,
filter(!is.na(string2),
# where the word is not in the comparison string
!grepl(string, string2, ignore.case = TRUE)) %>%
# regroup by ID
group_by(id) %>%
# reassemble strings
summarise(string = paste(string, collapse = ' '))
## # A tibble: 2 x 2
## id string
## <chr> <chr>
## 1 2 Very
## 2 3 and only one sentences.
Select out string if you'd like just a vector by appending
...
%>% `[[`('string')
## [1] "Very" "and only one sentences."

Related

Split words in R Dataframe column

I have a data frame with words in a column separated by single space. I want to split it into three types as below. Data frame looks as below.
Text
one of the
i want to
I want to split it into as below.
Text split1 split2 split3
one of the one one of of the
I am able to achieve 1st. Not able to figure out the other two.
my code to get split1:
new_data$split1<-sub(" .*","",new_data$Text)
Figured out the split2:
df$split2 <- gsub(" [^ ]*$", "", df$Text)
We can try with gsub. Capture one or more non-white space (\\S+) as a group (in this case there are 3 words), then in the replacement, we rearrange the backreference and insert a delimiter (,) which we use for converting to different columns with read.table.
df1[paste0("split", 1:3)] <- read.table(text=gsub("(\\S+)\\s+(\\S+)\\s+(\\S+)",
"\\1,\\1 \\2,\\2 \\3", df1$Text), sep=",")
df1
# Text split1 split2 split3
#1 one of the one one of of the
#2 i want to i i want want to
data
df1 <- structure(list(Text = c("one of the", "i want to")),
.Names = "Text", class = "data.frame", row.names = c(NA, -2L))
There might be more elegant solutions. Here are two options:
Using ngrams:
library(dplyr); library(tm)
df %>% mutate(splits = strsplit(Text, "\\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, function(words) ngrams(words, 2)[[1]]),
split3 = lapply(splits, function(words) ngrams(words, 2)[[2]])) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
Extract the two grams manually:
df %>% mutate(splits = strsplit(Text, "\\s+")) %>%
mutate(split1 = lapply(splits, `[`, 1)) %>%
mutate(split2 = lapply(splits, `[`, 1:2),
split3 = lapply(splits, `[`, 2:3)) %>%
select(-splits)
Text split1 split2 split3
1 one of the one one, of of, the
2 i want to i i, want want, to
Update:
With regular expression, we can use back reference of gsub.
Split2:
gsub("((.*)\\s+(.*))\\s+(.*)", "\\1", df$Text)
[1] "one of" "i want"
Split3:
gsub("(.*)\\s+((.*)\\s+(.*))", "\\2", df$Text)
[1] "of the" "want to"
This is a bit of hackish solution.
Assumption :- you are not concerned about number of spaces between two words.
> library(stringr)
> x<-c('one of the','i want to')
> strsplit(gsub('(\\S+)\\s+(\\S+)\\s+(.*)', '\\1 \\1 \\2 \\2 \\3', x), '\\s\\s+')
#[[1]]
#[1] "one" "one of" "of the"
#[[2]]
#[1] "i" "i want" "want to"

Truncate words within each element of a character vector in R

I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.
For example:
a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
"Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)
head(df)
file text
1 1 Words longer than five characters should be truncated
2 2 Words shorter than five characters should not be modified
And this is what I'm trying to get:
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):
x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than" "five" "chara" "shoul" "be" "trunc" "Words" "short" "than"
[12] "five" "chara" "shoul" "not" "be" "modif"
But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.
Is there a way to do this using gsub and regex?
If you're looking to utilize gsub to perform this task:
> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
# file text
# 1 1 Words longe than five chara shoul be trunc
# 2 2 Words short than five chara shoul not be modif
You were on the right track. In order for your idea to work, however, you have to do the split/trim/combine for each row separated. Here's a way to do it. I was very verbose on purpose, to make it clear, but you can obviously use less lines.
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- strtrim(str, 5)
str <- paste(str, collapse = " ")
str
})
And the output:
> df
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
The short version is
df$text <- sapply(df$text, function(str) {
paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")
})
Edit:
I just realized you asked if it is possible to do this using gsub and regex. Even though you don't need those for this, it's still possible, but harder to read:
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
str <- paste(str, collapse = " ")
str
})
The regex matches anything that appears after 5 characters and replaces those with nothing. perl = TRUE is necessary to enable the regex lookbehind ((?<=.{5})).

Eliminating the characters that are not a date in R

I have some data frame, df with a column with dates that are in the following format:
pv$day
01/01/13 00:00:00
03/01/13 00:02:03
04/03/13 00:10:15
....
I would like to eliminate the timestamp, just leaving the date (e.g. 01/01/13 for the first row). I have tried both using sapply() to apply the strsplit() function, and tried to filter the content using a regex, but don't seem to have quite gotten it right in either case. This:
sapply(pv$day, function(x) strsplit(toString(x), ' '))
gives me the column with the correct split, but indexing with either [1] or [[1]] does not return the first element of the split.
What is the best way to go about this?
You can use sub:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
sub(" .+", "", vec)
# [1] "01/01/13" "03/01/13" "04/03/13"
A simple, flexible solution is to use strptime and strftime. Here is an example that uses your dates from the example above:
# Your dates
t <- c("01/01/13 00:00:00","03/01/13 00:02:03", "04/03/13 00:10:15")
# Convert character strings to dates
z <- strptime(t, "%d/%m/%y %H:%M:%OS")
# Convert dates to string, omitting the time
z.date <- strftime(z,"%d/%m/%y")
# Print the first date
z.date[1]
Here's a nice way to use sapply, it uses strsplit to split at the space
> d <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
> sapply(strsplit(d, " "), `[`, 1)
# [1] "01/01/13" "03/01/13" "04/03/13"
And also, you could use stringr::word if you just want a character vector.
> library(stringr)
> word(d)
# [1] "01/01/13" "03/01/13" "04/03/13"
Here is an approach using a look around assertion:
vec <- c("01/01/13 00:00:00", "03/01/13 00:02:03", "04/03/13 00:10:15")
gsub(pattern = "(?=00).*$", replacement = "", vec, perl = TRUE)
[1] "01/01/13 " "03/01/13 " "04/03/13 "
The pattern looks for anything at the end of a string that begins with double 00, and removes it.

R Conditional Replace/Trim with Fill (regex,gsub,gregexpr,regmatches)

I have a question involving conditional replace.
I essentially want to find every string of numbers and, for every consecutive digit after 4, replace it with a space.
I need the solution to be vectorized and speed is essential.
Here is a working (but inefficient solution):
data <- data.frame(matrix(NA, ncol=2, nrow=6, dimnames=list(c(), c("input","output"))),
stringsAsFactors=FALSE)
data[1,] <- c("STRING WITH 2 FIX(ES): 123456 098765 1111 ",NA)
data[2,] <- c(" PADDED STRING WITH 3 FIX(ES): 123456 098765 111111 ",NA)
data[3,] <- c(" STRING WITH 0 FIX(ES): 12 098 111 ",NA)
data[4,] <- c(NA,NA)
data[5,] <- c("1234567890",NA)
data[6,] <- c(" 12345 67890 ",NA)
x2 <- data[,"input"]
x2
p1 <- "([0-9]+)"
m1 <- gregexpr(p1, x2,perl = TRUE)
nchar1 <- lapply(regmatches(x2, m1), function(x){
if (length(x)==0){ x <- NA } else ( x <- nchar(x))
return(x) })
x3 <- mapply(function(match,length,text,cutoff) {
temp_comb <- data.frame(match=match, length=length, stringsAsFactors=FALSE)
for(i in which(temp_comb[,"length"] > cutoff))
{
before <- substr(text, 1, (temp_comb[i,"match"]-1))
middle_4 <- substr(text, temp_comb[i,"match"], temp_comb[i,"match"]+cutoff-1)
middle_space <- paste(rep(" ", temp_comb[i,"length"]-cutoff),sep="",collapse="")
after <- substr(text, temp_comb[i,"match"]+temp_comb[i,"length"], nchar(text))
text <- paste(before,middle_4,middle_space,after,sep="")
}
return(text)
},match=m1,length=nchar1,text=x2,cutoff=4)
data[,"output"] <- x3
Is there a better way?
I was looking at the help section for regmatches and there was a similar type question, but it was full replacement with blanks and not conditional.
I would have written some alternatives and benchmarked them but honestly I couldn't think of other ways to do this.
Thanks ahead of time for the help!
UPDATE
Fleck,
Using your way but making cutoff an input, I am getting an error for the NA case:
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x,cutoff) {
# x <- regmatches(data$input, m)[[4]]
# cutoff <- 4
mapply(function(x, n, cutoff){
formatC(substr(x,1,cutoff), width=-n)
}, x=x, n=nchar(x),cutoff=cutoff)
},cutoff=4)
Here's a fast approach with just one gsub command:
gsub("(?<!\\d)(\\d{4})\\d*", "\\1", data$input, perl = TRUE)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234"
# [6] " 1234 6789 "
The string (?<!\\d) is a negative lookahead: A position that is not preceded by a digit. The string (\\d{4}) means 4 consecutive digits. Finally, \\d* represents any number of digits. The part of the string that matches this regex is replaced by the first group (the first 4 digits).
An approach that does not change string length:
matches <- gregexpr("(?<=\\d{4})\\d+", data$input, perl = TRUE)
mapply(function(m, d) {
if (!is.na(m) && m != -1L) {
for (i in seq_along(m)) {
substr(d, m[i], m[i] + attr(m, "match.length") - 1L) <- paste(rep(" ", attr(m, "match.length")[i]), collapse = "")
}
}
return(d)
}, matches, data$input)
# [1] "STRING WITH 2 FIX(ES): 1234 0987 1111 "
# [2] " PADDED STRING WITH 3 FIX(ES): 1234 0987 1111 "
# [3] " STRING WITH 0 FIX(ES): 12 098 111 "
# [4] NA
# [5] "1234 "
# [6] " 1234 6789 "
You can do the same in one line (and one space for one digit) with:
gsub("(?:\\G(?!\\A)|\\d{4})\\K\\d", " ", data$input, perl = TRUE)
details:
(?: # non-capturing group: the two possible entry points
\G # either the position after the last match or the start of the string
(?!\A) # exclude the start of the string position
| # OR
\d{4} # four digits
) # close the non-capturing group
\K # removes all on the left from the match result
\d # a single digit
Here's a way with gregexpr and regmatches
#find all numbers with more than 4 digits
m <- gregexpr("\\d{5,}", data$input)
#replace numbers afther the 4th with spaces for those matches
zz<-lapply(regmatches(data$input, m), function(x) {
mapply(function(x, n) formatC(substr(x,1,4), width=-n), x, nchar(x))
})
#combine with original values
data$output2 <- unlist(Map(function(a,b) paste0(a,c(b,""), collapse=""),
regmatches(data$input, m, invert=T), zz))
The different here is that it turns the NA value into "". We could add in other checks to prevent that or just turn all zero length strings into missing values at the end. I just didn't want to over-complicate the code with safety checks.

Better Strategy for pulling elements from string

I have a string that looks like this:
x <- "\r\n Ticker Symbol: RBO\r\n \t Exchange: TSX \r\n\t Assets ($mm) 36.26 \r\n\t Units Outstanding: 1,800,000 \r\n\t Mgmt. Fee** 0.25 \r\n 2013 MER* n/a \r\n\t CUSIP: 74932K103"
What I need is this:
list(Ticker = "RBO", Assets = 36.26, Shares = 1,800,000)
I've tried splitting, regex, etc. But I feel my string manipulation skills are not up to snuff.
Here's my "best" attempt so far.
x <- unlist(strsplit(unlist(strsplit(x, "\r\n\t") ),"\r\n"))
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
x <- trim(x)
gsub("[A-Z]+$","\\2",x[2]) # bad attempt to get RBO
Update/better answer:
A look at cat(x) and readLines(x) helps a lot here
> cat(x)
#
# Ticker Symbol: RBO
# Exchange: TSX
# Assets ($mm) 36.26 #
# Units Outstanding: 1,800,000
# Mgmt. Fee** 0.25
# 2013 MER* n/a
# CUSIP: 74932K103
> readLines(textConnection(x))
# [1] "" " Ticker Symbol: RBO"
# [3] " \t Exchange: TSX " "\t Assets ($mm) 36.26 "
# [5] "\t Units Outstanding: 1,800,000 " "\t Mgmt. Fee** 0.25 "
# [7] " 2013 MER* n/a " "\t CUSIP: 74932K103"
Now we know a few things. One, we don't need the first line, and we do want the second line. That makes things easier because now the first line matches our desired first line. Next, it would be easier your list names matched the names in the string. I chose these.
> nm <- c("Symbol", "Assets", "Units")
Now all we have to do use grep with sapply, and we'll get back a named vector of matches. Setting value = TRUE in grep will return us the strings.
> (y <- sapply(nm, grep, x = readLines(textConnection(x))[-1], value = TRUE))
# b Symbol Assets
# " Ticker Symbol: RBO" "\t Assets ($mm) 36.26 "
# Units
# "\t Units Outstanding: 1,800,000 "
Then we strsplit that on "[: ]", take the last element in each split, and we're done.
> lapply(strsplit(y, "[: ]"), tail, 1)
$Symbol
[1] "RBO"
$Assets
[1] "36.26"
$Units
[1] "1,800,000
You could achieve the same result with
> g <- gsub("[[:cntrl:]]", "", capture.output(cat(x))[-1])
> m <- mapply(grep, nm, MoreArgs = list(x = g, value = TRUE))
> lapply(strsplit(m, "[: ]"), tail, 1)
Hope that helps.
Original Answer:
It looks like if you're pulling these from a large table, that they'd all be in the same element "slot" each time, so maybe this might be a little easier.
> s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]]
Explained:
- [: ] match a ":" character followed by a space character
- | or
- [[:cntrl:]] any control character, which in this case is any of \r, \t, and \n. This is probably better explained here
Then, nzchar looks in the above result for non-zero length character strings, and returns TRUE if matched, FALSE otherwise. So we can look at the result of the first line, determine where the matches are, and subset based on that.
> as.list(s[nzchar(s)][c(3, 8, 11)])
[[1]]
[1] "RBO"
[[2]]
[1] "36.26"
[[3]]
[1] "1,800,000"
You could put is into one line by assigning s as the inner call. Since functions and calls are evaluated from the inside out, s is assigned before R reaches the outside s subset. This is a bit less readable though.
s[nzchar(s <- strsplit(x, "[: ]|[[:cntrl:]]")[[1]])][c(3,8,11)]
So this would go s <- strsplit(...) -> [[ -> nzchar -> s[.. >- [c(3,8,11)]
Perhaps:
sub( "\\\r\\\n.+$", "", sub( "^.+Ticker Symbol: ", "", x) )
[1] "RBO"
I suppose you might do it all in one pattern with parentheses. and backreference.
> sub( "^.+Ticker Symbol: ([[:alpha:]]{1,})\\\r\\\n.+$", "\\1", x)
[1] "RBO"
If you just want to extract different parts of the string, you can use regexpr to find phrases and extract the contents after the phrase. For example
extr<-list(
"Ticker" = "Ticker Symbol: ",
"Assets" = "Assets ($mm) ",
"Shares" = "Units Outstanding: "
)
lines<-strsplit(x,"\r\n")[[1]]
Map(function(p) {
m <- regexpr(p, lines, fixed=TRUE)
if(length( w<- which(m!=-1))==1) {
gsub("^\\sw+|\\s$", "",
substr(lines[w], m[w] + attr(m,"match.length")[w], nchar(lines[w])))
} else {
NA
}
}, extr)
Which returns the named list as desired
$Ticker
[1] "RBO"
$Assets
[1] "36.26"
$Shares
[1] "1,800,000"
Here extr is a list where the name of the element is the name that will be used in the final list, and the element value is the exact string that will be matched in the text. I added in a gsub as well to trim off any whitespace.
The stringr package is good for scraping data from strings. Here are the steps I use every time. You can always make the rules as specific or robust as you see fit.
require(stringr)
## take out annoying characters
x <- gsub("\r\n", "", x)
x <- gsub("\t", "", x)
x <- gsub("\\(\\$mm\\) ", "", x)
## define character index positions of interest
tickerEnd <- str_locate(x, "Ticker Symbol: ")[[1, "end"]]
assetsEnd <- str_locate(x, "Assets ")[[1, "end"]]
unitsStart <- str_locate(x, "Units Outstanding: ")[[1, "start"]]
unitsEnd <- str_locate(x, "Units Outstanding: ")[[1, "end"]]
mgmtStart <- str_locate(x, "Mgmt")[[1, "start"]]
## get substrings based on indices
tickerTxt <- substr(x, tickerEnd + 1, tickerEnd + 4) # allows 4-character symbols
assetsTxt <- substr(x, assetsEnd + 1, unitsStart - 1)
sharesTxt <- substr(x, unitsEnd + 1, mgmtStart - 1)
## cut out extraneous characters
ticker <- gsub(" ", "", tickerTxt)
assets <- gsub(" ", "", assetsTxt)
shares <- gsub(" |,", "", sharesTxt)
## add data to data frame
df <- data.frame(ticker, as.numeric(assets), as.numeric(shares), stringsAsFactors = FALSE)