Exact string matching in r - regex

I struggling with exact string matching in R. I need only exact match in sentece with searched string:
sentence2 <- "laptop is a great product"
words2 <- c("top","laptop")
I was trying something like this:
sub(paste(c("^",words2,"$")),"",sentence2)
and I need replace laptop by empty string only - for exact match (laptop) but didn't work...
Please, could you help me. Thanks in advance.
Desired output:
is a great product

You can try:
gsub(paste0("^",words2," ",collapse="|"),"",sentence2)
#[1] "is a great product"
The result of paste0("^",words2," ",collapse="|") is "^top |^laptop " which means "either 'top' at the beginning of string followed by a space or 'laptop' at the beginning of string followed by a space".

If you want to match entire words, then you can use \\b to match word boundaries.
gsub(paste0('\\b', words2, '\\b', collapse='|'), '', sentence2)
## [1] " is a great product"
Add optional whitespace to the pattern if you want to replace the adjacent spaces as well.
gsub(paste0('\\s*\\b', words2, '\\b\\s*', collapse='|'), '', sentence2)
## [1] "is a great product"

Related

Replacing NA`s with 0 or space depending on the position of the `NA` in a string

I have a data frame with a million of long strings which contains 0, 1, NA`s.
I have to replace NA`s based on the following method:
all of the NA at the end of the string has to be replaced with space
all of the NA at the middle of the string has to be changed to 0.
Example:
Let`s assume I have the following string
0011NANA01NA0011NANANANA
My desired output:
'011000100011____',
which means at the end of the string all of the NA should be replaced by space (I used '_' to indicate spaces).
AFAIK I should use gsub() to make this changes. I tried to use the following code`s without any success.
gsub("NA", " ", "0011NANA01NA0011NANANANA") - which replaces all of the NAs with space.
gsub("NA$", " ", "0011NANA01NA0011NANANANA") - which replaces the last NA of the string with space.
This works fine if I have only one NA at the end of the string. But how can I change all of the 4 NA`s at the end of the string in this example?
Could someone help me out with this problem?
Thanks in advance for all kind of help!
This'll do it. But like Richard said, you may want to focus your efforts on earlier in the code, if it's in your power.
s <- "0011NANA01NA0011NANANANA"
#inner regex: find NA which is followed by
# _only_ N or A until the string ends.
# those are spaces.
#outer regex: replace remaining NA with 0
gsub("NA", "0", gsub("NA(?=[NA]*$)", " ", s, perl = TRUE))
# [1] "0011000100011 "
Explore the more complicated regex here
Here is another nested gsub where the first one replace the "NA" with space. In the second gsub, we match one or more space (\\s+) at the end of the string ($). By using (*SKIP)(*FAIL), it force all the characters that are matched on the left are skipped and allow the second pattern to be matched (\\s) i.e. any space that are not at the end of the string and replace it will 0.
gsub("\\s+$(*SKIP)(*F)|\\s", "0", gsub("NA", " ", s), perl=TRUE)
#[1] "0011000100011 "
data
s <- "0011NANA01NA0011NANANANA"

How to Extract a substring that matches a Perticular Regular expression match from a String in R

I am trying to write a function so that i can get all the substrings from a string that matches a regular expression , example : -
str <- "hello Brother How are you"
I want to extract all the substrings from str , where those substrings matches this regular expression - "[A-z]+ [A-z]+"
which results in -
"hello Brother"
"Brother How"
"How are"
"are you"
is there any library function which can do that ?
You can do it with stringr library str_match_all function and the method Tim Pietzcker described in his answer (capturing inside an unanchored positive lookahead):
> library(stringr)
> str <- "hello Brother How are you"
> res <- str_match_all(str, "(?=\\b([[:alpha:]]+ [[:alpha:]]+))")
> l <- unlist(res)
> l[l != ""]
## [1] "hello Brother" "Brother How" "How are" "are you"
Or to only get unqiue values:
> unique(l[l != ""])
##[1] "hello Brother" "Brother How" "How are" "are you"
I just advise to use [[:alpha:]] instead of [A-z] since this pattern matches more than just letters.
Regex matches "consume" the text they match, therefore (generally) the same bit of text can't match twice. But there are constructs called lookaround assertions which don't consume the text they match, and which may contain capturing groups.
That makes your endeavor possible (although you can't use [A-z], that doesn't do what you think it does):
(?=\b([A-Za-z]+ [A-Za-z]+))
will match as expected; you need to look at group 1 of the match result, not the matched text itself (which will always be empty).
The \b word boundary anchor is necessary to ensure that our matches always start at the beginning of a word (otherwise you'd also have the results "ello Brother", "llo Brother", "lo Brother", and "o Brother").
Test it live on regex101.com.

Matching a word after another word in R regex

I have a dataframe in R with one column (called 'city') containing a text string. My goal is to extract only one word ie the city text from the text string. The city text always follows the word 'in', eg the text might be:
'in London'
'in Manchester'
I tried to create a new column ('municipality'):
df$municipality <- gsub(".*in ?([A-Z+).*$","\\1",df$city)
This gives me the first letter following 'in', but I need the next word (ONLY the next word)
I then tried:
gsub(".*in ?([A-Z]\w+))")
which worked on a regex checker, but not in R. Can someone please help me. I know this is probably very simple but I can't crack it. Thanks in advance.
We can use str_extract
library(stringr)
str_extract(df$city, '(?<=in\\s)\\w+')
#[1] "London" "Manchester"
The following regular expression will match the second word from your city column:
^in\\s([^ ]*).*$
This matches the word in followed a single space, followed by a capture group of any non space characters, which comprises the city name.
Example:
df <- data.frame(city=c("in London town", "in Manchester city"))
df$municipality <- gsub("^in\\s([^ ]*).*$", "\\1", df$city)
> df$municipality
[1] "London" "Manchester"

Extracting clock time from string

I have a dataframe that consists of web-scraped data. One of the fields scraped was a time in clock time, but the scraping process wasn't perfect. Most of the 'good' data look something like '4:33, or '103:20 (so a leading single quote, and two fields, minutes and seconds). Also, there is some bad data, the most common one being '],, but also some containing text. I'd like a new string that is something like 4:33, and for bad data, just blank.
So my plan of attack is to match my good data form, and then replace everything else with a blank space. Sometime like time <- gsub('[0-9]+:[0-9]+', '', time). I know this would replace my pattern with a blank, and I want the opposite, but I'm unsure as to how to negate this whole pattern. A simple carat doesn't seem to work, nor applying it to a group. I tried something like gsub("(.)+([0-9]+)(:)([0-9]+)", "\\2\\3\\4", time) but that isn't working either.
Sample:
dput(sample)
c("'], ", "' Ling (2-0)vsThe Dragon(2-0)", "'8:18", "'13:33",
"'43:33")
Expected output:
c("", "", "8:18", "13:33", "43:33")
We can use grep to replace the elements that do not follow the pattern to '' and then replace the quotes (') with ''. Here, the pattern is the strings that start (^) with ' followed by numbers, :, numbers in that order to the end ($) of the string. So, all other string elements (by negating i.e. !) are assigned to '' using the logical index from grepl and we use sub to replace the '.
sample[!grepl("^'\\d+:\\d+$", sample)] <- ''
sub("'", '', sample)
#[1] "" "" "8:18" "13:33" "43:33"
Or we can also do this in one step using gsub by replacing all those characters (.) that do not follow the pattern \\d+:\\d+ with ''.
gsub("(\\d+:\\d+)(*SKIP)(*F)|.", '', sample, perl=TRUE)
#[1] "" "" "8:18" "13:33" "43:33"
Or another option is str_extract from library(stringr). It is not clear whether there are other patterns such as "some text '08:20 value" in the OP's original dataset or not. The str_extract will also extract those time values, if present.
library(stringr)
str_extract(sample, '\\d+:\\d+')
#[1] NA NA "8:18" "13:33" "43:33"
It will give NA instead of '' for those that doesn't follow the pattern.
You can use sub:
sub('.+?(?=[0-9]+:[0-9]+)|.+', '', sample, perl = TRUE)
[1] "" "" "8:18" "13:33" "43:33"
The regex consists of two parts that are combined with a logical or (|).
.+?(?=[0-9]+:[0-9]+)
This regex matches a positive number of characters followed by the target pattern.
.+ This regex matches a positive number of characters.
The logic: Replace everything preceding thte target pattern with an empty string (''). If there is no target pattern, replace everything with the empty string.

Regex Valid Twitter Mention

I'm trying to find a regex that matches if a Tweet it's a true mention. To be a mention, the string can't start with "#" and can't contain "RT" (case insensitive) and "#" must start the word.
In the examples I commented the desired output
Some examples:
function search($strings, $regexp) {
$regexp;
foreach ($strings as $string) {
echo "Sentence: \"$string\" <- " .
(preg_match($regexp, $string) ? "MATCH" : "NO MATCH") . "\n";
}
}
$strings = array(
"Hi #peter, I like your car ", // <- MATCH
"#peter I don't think so!", //<- NO MATCH: the string it's starting with # it's a reply
"Helo!! :# how are you!", // NO MATCH <- it's not a word, we need #(word)
"Yes #peter i'll eat them this evening! RT #peter: hey #you, do you want your pancakes?", // <- NO MATCH "RT/rt" on the string , it's a RT
"Helo!! ineed#aser.com how are you!", //<- NO MATCH, it doesn't start with #
"#peter is the best friend you could imagine. RT #juliet: #you do you know if #peter it's awesome?" // <- NO MATCH starting with # it's a reply and RT
);
echo "Example 1:\n";
search($strings, "/(?:[[:space:]]|^)#/i");
Current output:
Example 1:
Sentence: "Hi #peter, I like your car " <- MATCH
Sentence: "#peter I don't think so!" <- MATCH
Sentence: "Helo!! :# how are you!" <- NO MATCH
Sentence: "Yes #peter i'll eat them this evening! RT #peter: hey #you, do you want your pancakes?" <- MATCH
Sentence: "Helo!! ineed#aser.com how are you!" <- MATCH
Sentence: "#peter is the best friend you could imagine. RT #juliet: #you do you know if #peter it's awesome?" <- MATCH
EDIT:
I need it in regex beacause it can be used on MySQL and anothers
languages too. Im am not looking for any username. I only want to know
if the string it's a mention or not.
This regexp might work a bit better: /\B\#([\w\-]+)/gim
Here's a jsFiddle example of it in action: http://jsfiddle.net/2TQsx/96/
Here's a regex that should work:
/^(?!.*\bRT\b)(?:.+\s)?#\w+/i
Explanation:
/^ //start of the string
(?!.*\bRT\b) //Verify that rt is not in the string.
(?:.*\s)? //Find optional chars and whitespace the
//Note: (?: ) makes the group non-capturing.
#\w+ //Find # followed by one or more word chars.
/i //Make it case insensitive.
I have found that this is the best way to find mentions inside of a string in javascript. I don't know exactly how i would do the RT's but I think this might help with part of the problem.
var str = "#jpotts18 what is up man? Are you hanging out with #kyle_clegg";
var pattern = /#[A-Za-z0-9_-]*/g;
str.match(pattern);
["#jpotts18", "#kyle_clegg"]
I guess something like this will do it:
^(?!.*?RT\s).+\s#\w+
Roughly translated to:
At the beginning of string, look ahead to see that RT\s is not present, then find one or more of characters followed by a # and at least one letter, digit or underscore.
Twitter has published the regex they use in their twitter-text library. They have other language versions posted as well on GitHub.
A simple but works correctly even if the scraping tool has appended some special characters sometimes: (?<![\w])#[\S]*\b. This worked for me