R: detect words and punctuation marks in text - regex

I have some naturally occuring text:
text="word1 word2 word3. word4, word5 word6 word7"
And some elements that I want to detect in that text:
elements=c("word2","word6 word7",".",",")
However,
elements[sapply(paste0("\\<",elements,"\\>"),grepl,text)]
only returns the unigram "word2" and the bigram "word6 word7". The period and comma, which are in the text, are not detected.
How can I achieve that?

You don't need to include the square brackets, since sqaure brackets are special meta charcaters in regex which means a character class.
> text="word1 word2 word3. word4, word5 word6 word7"
> elements=c("word2","word6 word7",".",",")
> elements[sapply(paste0(elements),grepl,text, fixed=T)]
[1] "word2" "word6 word7" "." ","

elements[sapply(paste0("[",elements,"]"),grepl,text)] does the job.

Related

Regex negative lookaround with optional whitespace

I am trying to find the digits, not succeeded by certain words. I do this using regular expressions in Python3. My guess is that negative lookarounds have to be used, but I'm struggling due to optional whitespaces. See the following example:
'200 word1 some 50 foo and 5foo 30word2'
Note that in reality word1 and word2 can be replaced by a lot of different words, making it much harder to search for a positive match on these words. Therefore it would be easier to exclude the numbers succeeded by foo. The expected result is:
[200, 30]
My try:
s = '200 foo some 50 bar and 5bar 30foo
pattern = r"[0-9]+\s?(?!foo)"
re.findall(pattern, s)
Results in
['200', '50 ', '5', '3']
You may use
import re
s = '200 word1 some 50 foo and 5foo 30word2'
pattern = r"\b[0-9]+(?!\s*foo|[0-9])"
print(re.findall(pattern, s))
# => ['200', '30']
See the Python demo and the regex graph:
Details
\b - a word boundary
[0-9]+ - 1+ ASCII digits only
(?!\s*foo|[0-9]) - not immediately followed with
\s*foo - 0+ whitespaces and foo string
| - or
[0-9] - an ASCII digit.
You should be using the pattern \b[0-9]+(?!\s*foo\b)(?=\D), which says to find all number which are not followed by optional whitespace and the word foo.
s = '200 word1 some 50 foo and 5foo 30word2'
matches = re.findall(r'\b[0-9]+(?!\s*foo\b)(?=\D)', s)
print(matches)
This prints:
['200', '30']

notepad++ how to insert replacement after the following word

text: [aa-b c d...]
result: [b-123 c d...]
text:[aa-word1 word2 word3 ...]
result[word1-123 word2 word...]
[aa-bananas oranges apples]
[bananas-123 oranges apples]
I want to replace aa- but -123 should be only placed after the next word.
The next word should be a parameter, instead of a fixed text like the insert aa-. This is because there are many different cases to be replaced.
I'll change "aa-" to many other variants. "bb-" "cc-"...
But the word1 is always a variable in the text.
Ctrl+H
Find what: \[aa-(\w+)
Replace with: [$1-123
check Match case
check Wrap around
check Regular expression
Replace all
Explanation:
\[ # opening square bracket
aa # literally 2 a
- # hyphen
(\w+) # group 1, 1 or more word character
Result for given example:
[b-123 c d...]
[word1-123 word2 word3 ...]
[bananas-123 oranges apples]
Screen capture:

Regular expression for matching the same words in the text

For example, I have text:
The word1 word2 is word3
(note word1 can be == word2 == word3)
I want my regular expression work when distance between words word(i) <= N. Distance is the number of words between words word.
The distance between word1 and word2 id 0.
The distance between word2 and word3 is 1.
The distance between the word1 and word3 (=2) should not be taken into account.
I make regular expression to solve this problem, but it takes into account the distance between the first and the last same words. How can I fix it?
(\b\w+\b)\W+((\b\w+\b)\W+){N,}?\1
For my text example I want regular expression which will be find matches, only when N=0 or 1.
(\b\w+\b)\W+((\b\w+\b)\W+){0,}?\1
(\b\w+\b)\W+((\b\w+\b)\W+){1,}?\1
But now it works also when N=2
(\b\w+\b)\W+((\b\w+\b)\W+){2,}?\1

how to replace a single/double character in a string

I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character. So i have put spaces before and after the character but that doesn't seem to work. I also wanted to replace string with more than 1 char. i.e if i want to replace all char with length 2 or so, then how would the code change.
str="I have a cat of white color"
str=gsub("([[:space:]][[a-z]][[:space:]])", "", str)
I want to replace all the single character in my string with a blank. My idea is that there should be a space before and after the single character.
The idea is not correct, a word is not always surrounded with spaces. What if the words is at the beginning of the string? Or at the end? Or is followed with a punctuation?
Use \b word boundary:
There are three different positions that qualify as word boundaries:
- Before the first character in the string, if the first character is a word character.
- After the last character in the string, if the last character is a word character.
- Between two characters in the string, where one is a word character and the other is not a word character.
NOTE that in R, when you use gsub, it is best to use it with the PCRE regex (pass perl=T):
POSIX 1003.2 mode of gsub and gregexpr does not work correctly with repeated word-boundaries (e.g., pattern = "\b"). Use perl = TRUE for such matches (but that may not work as expected with non-ASCII inputs, as the meaning of ‘word’ is system-dependent).
So, to match all 1-letter words, you need to use
gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
Note that (?i) is a case-insensitive modifier (making a match both a and A).
Now, you need to match 2 letter words:
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
Here, we are using a limiting quantifier {min, max} / {max} to specify how many times the pattern quantified with this construct can be repeated.
See IDEONE demo:
> input = "I am a football fan"
> gsub("(?i)\\b[a-z]\\b", "REPLACEMENT", input, perl=T) ## To replace 1 ASCII letter words
[1] "REPLACEMENT am REPLACEMENT football fan"
gsub("(?i)\\b[a-z]{2}\\b", "REPLACEMENT", input, perl=T) ## To replace 2 ASCII letter words
[1] "I REPLACEMENT a football fan"
You need to use the quantifier regex property, e.g. [a-z]{2} which matches the letters a to z twice together. The regex pattern you want is something along the lines of this:
\\s[a-z]{2}\\s
You can build this regex dynamically in R using an input number of characters. Here is a code snippet which demonstrates this:
str <- "I have a cat of white color"
nchars <- 2
exp <- paste0("\\s[a-z]{", nchars, "}\\s")
> gsub(exp, "", str)
[1] "I have a catwhite color"

Replace words starting with particular character in R

I want to replace all words that begin with a given character with a different word. Tried gsub and str_replace_all but with little success. In this example I want to replace all words starting with R with MM. gsub replaces properly only once:
gsub("^R*\\w+", "MM", "Red, Rome, Ralf")
# [1] "MM, Rome, Ralf"
Thanks in advance
You must either remove the string start anchor (^) or work with a vector of words:
gsub("\\bR\\w+", "MM", "Red, Rome, Ralf")
#[1] "MM, MM, MM"
gsub("^R\\w+", "MM", c("Red", "Rome", "Ralf"))
#[1] "MM" "MM" "MM"
Also, you probably want "R" instead of "R*", since the latter can match 0 or more instances of "R". The regexes above match only words with 2 or more characters, the first of which must be "R". The last regex only matches words at the beginning of the string.
Thanks #flodel for pointing out the missing word boundary "\b" in the first regex!