For instance, in this example, I would like to remove the elements in text that contain http and america.
> text <- c("One word#", "112a httpSentenceamerica", "you and meamerica", "three two one")
Hence, I would use the logical operator, |.
> pattern <- "http|america"
Which works because this is considered to be one pattern.
> grep(pattern, text, invert = TRUE, value = TRUE)
[1] "One word#" "three two one"
What if I have a long list of words that I would like to use in the pattern? How can I do it? I don't think I can keep on using the logical operators a lot of times.
Thank you in advance!
Generally, as #akrun said:
text <- c("One word#", "112a httpSentenceamerica", "you and meamerica", "three two one")
pattern = c("http", "america")
grep(paste(pattern, collapse = "|"), text, invert = TRUE, value = TRUE)
# [1] "One word#" "three two one"
You wrote that your list of words is "long." This solution doesn't scale indefinitely, unsurprisingly:
long_pattern = paste(rep(pattern, 1300), collapse = "|")
nchar(long_pattern)
# [1] 16899
grep(long_pattern, text, invert = TRUE, value = TRUE)
# Error in grep(long_pattern, text, invert = TRUE, value = TRUE) :
But if necessary, you could MapReduce, starting with something along the lines of:
text[Reduce(`&`, Map(function(p) !grepl(p, text), long_pattern))]
# [1] "One word#" "three two one"
Related
I am trying to loop through regex results, and insert the first capture group into a variable to be processed in a loop. But I can't figure out how to do so. Here's what I have so far, but it just prints the second match:
aQuote = "The big boat has a big assortment of big things."
theMatches = regmatches(aQuote, gregexpr("big ([a-z]+)", aQuote ,ignore.case = TRUE))
results = lapply(theMatches, function(m){
capturedItem = m[[2]]
print(capturedItem)
})
Right now it prints
[1] "big assortment"
What I want it to print is
[1] boat
[1] assortment
[1] things
Try this:
regmatches(aQuote, gregexpr("(?<=big )[a-z]+", aQuote ,ignore.case = TRUE,perl=TRUE))[[1]]
#[1] "boat" "assortment" "things"
Include g (global) modifier in you code as well.
Equivalent regex in perl / javascript is: /big ([a-z]+)/ig
Sample perl prog:
$aQuote = "The big boat has a big assortment of big things.";
print $1."\n" while ($aQuote =~ /big ([a-z]+)/ig);
JS Fiddle here.
Edit: In r, we can write:
aQuote = "The big boat has a big assortment of big things."
theMatches = regmatches(aQuote, gregexpr("big ([a-z]+)", aQuote ,ignore.case = TRUE))
results = lapply(theMatches, function(m){
len= length(m)
for (i in 1:len)
{
print(m[[i]])
}
})
r fiddle here.
Is there a better way to do a string conditional match? for example the word farm is conditionally matched with rose, floral and tree. ideally I would like to do the matching without repeating farm
str = c('rose','farm','rose farm','floral', 'farm floral', 'tree farm')
grep("((?=.*farm)(?=.*rose)|(?=.*farm)(?=.*floral)|(?=.*farm)(?=.*tree))", str, value = TRUE,,perl = TRUE)
this return
[1] "rose farm" "farm floral" "tree farm"
One way — use a grouping construct to combine the set of words:
grep('(?=.*farm)(?=.*(?:rose|floral|tree))', str, value = TRUE, perl = TRUE)
# [1] "rose farm" "farm floral" "tree farm"
I have a data frame where one column is a character vector and every element in the vector is the full text of a document. I want to truncate words in each element so that maximum word length is 5 characters.
For example:
a <- c(1, 2)
b <- c("Words longer than five characters should be truncated",
"Words shorter than five characters should not be modified")
df <- data.frame("file" = a, "text" = b, stringsAsFactors=FALSE)
head(df)
file text
1 1 Words longer than five characters should be truncated
2 2 Words shorter than five characters should not be modified
And this is what I'm trying to get:
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
I've tried using strsplit() and strtrim() to modify each word (based in part on split vectors of words by every n words (vectors are in a list)):
x <- unlist(strsplit(df$text, "\\s+"))
y <- strtrim(x, 5)
y
[1] "Words" "longe" "than" "five" "chara" "shoul" "be" "trunc" "Words" "short" "than"
[12] "five" "chara" "shoul" "not" "be" "modif"
But I don't know if that's the right direction, because I ultimately need the words in a data frame associated with the correct row, as shown above.
Is there a way to do this using gsub and regex?
If you're looking to utilize gsub to perform this task:
> df$text <- gsub('(?=\\b\\pL{6,}).{5}\\K\\pL*', '', df$text, perl=T)
> df
# file text
# 1 1 Words longe than five chara shoul be trunc
# 2 2 Words short than five chara shoul not be modif
You were on the right track. In order for your idea to work, however, you have to do the split/trim/combine for each row separated. Here's a way to do it. I was very verbose on purpose, to make it clear, but you can obviously use less lines.
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- strtrim(str, 5)
str <- paste(str, collapse = " ")
str
})
And the output:
> df
file text
1 1 Words longe than five chara shoul be trunc
2 2 Words short than five chara shoul not be modif
The short version is
df$text <- sapply(df$text, function(str) {
paste(strtrim(unlist(strsplit(str, " ")), 5), collapse = " ")
})
Edit:
I just realized you asked if it is possible to do this using gsub and regex. Even though you don't need those for this, it's still possible, but harder to read:
df$text <- sapply(df$text, function(str) {
str <- unlist(strsplit(str, " "))
str <- gsub("(?<=.{5}).+", "", str, perl = TRUE)
str <- paste(str, collapse = " ")
str
})
The regex matches anything that appears after 5 characters and replaces those with nothing. perl = TRUE is necessary to enable the regex lookbehind ((?<=.{5})).
I have a string like this:
text <- c("Car", "Ca-R", "My Car", "I drive cars", "Chars", "CanCan")
I would like to match a pattern so it is only matched once and with max. one substitution/insertion. the result should look like this:
> "Car"
I tried the following to match my pattern only once with max. substitution/insertion etc and get the following:
> agrep("ca?", text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
[1] "Car" "Ca-R" "My Car" "I drive cars" "CanCan"
Is there a way to exclude the strings which are n-characters longer than my pattern?
An alternative which replaces agrep with adist:
text[which(adist("ca?", text, ignore.case=TRUE) <= 1)]
adist gives the number of insertions/deletions/substitutions required to convert one string to another, so keeping only elements with an adist of equal to or less than one should give you what you want, I think.
This answer is probably less appropriate if you really want to exclude things "n-characters longer" than the pattern (with n being variable), rather than just match whole words (where n is always 1 in your example).
You can use nchar to limit the strings based on their length:
pattern <- "ca?"
matches <- agrep(pattern, text, ignore.case = T, max = list(substitutions = 1, insertions = 1, deletions = 1, all = 1), value = T)
n <- 4
matches[nchar(matches) < n+nchar(pattern)]
# [1] "Car" "Ca-R" "My Car" "CanCan"
Say I have a line in a file:
string <- "thanks so much for your help all along. i'll let you know when...."
I want to return a value indicating if the word know is within 6 words of help.
This is essentially a very crude implementation of Crayon's answer as a basic function:
withinRange <- function(string, term1, term2, threshold = 6) {
x <- strsplit(string, " ")[[1]]
abs(grep(term1, x) - grep(term2, x)) <= threshold
}
withinRange(string, "help", "know")
# [1] TRUE
withinRange(string, "thanks", "know")
# [1] FALSE
I would suggest getting a basic idea of the text tools available to you, and using them to write such a function. Note Tyler's comment: As implemented, this can match multiple terms ("you" would match "you" and "your") leading to funny results. You'll need to determine how you want to deal with these cases to have a more useful function.
you won't be able to get this from regex alone. I suggest splitting using space as delimiter, then loop or use a built-in function to do array search of your two terms and subtract the difference of the indexes (array positions).
edit: Okay I thought about it a second and perhaps this will work for you as a regex pattern:
\bhelp(\s+[^\s]+){1,5}+\s+know\b
This takes the same "space is the delimiter" concept. First matches for help then greedily up to 5 " word" then looks for " know" (since "know" would be the 6th).
Split your string:
> words <- strsplit(string, '\\s')[[1]]
Build a indices vector:
> indices <- 1:length(words)
Name indices:
> names(indices) <- words
Compute distance between words:
> abs(indices["help"] - indices["know"]) < 6
FALSE
EDIT In a function
distance <- function(string, term1, term2) {
words <- strsplit(string, "\\s")[[1]]
indices <- 1:length(words)
names(indices) <- words
abs(indices[term1] - indices[term2])
}
distance(string, "help", "know") < 6
EDIT Plus
There is a great advantage in indexing words, once its done you can work on a lot of statistics on a text.