gsub with exception in R - regex

I'm removing English characters from Hebrew text but would like to keep a short list of English words that i want, e.g. words2keep <- c("ok", "hello", "yes*").
So my current regex is text <- gsub("[A-Z,a-z]", "", text) , but the question is how to add the exception so it will not remove all English words.
reproducibe example:
text = "ok אני מסכים איתך Yossi Cohen"
after gsub with exception
text = "ok אני מסכים איתך"
Thank you for all suggestions

This is a tricky one. I think we can do it by matching against whole words by making use of the \b word boundary assertion, and at the same time include a negative lookahead assertion just prior to the match which rejects the words (again, whole words) that you want to blacklist for removal (or equivalently whitelist for preservation). This appears to be working:
gsub(perl=T,paste0('(?!\\b',paste(collapse='\\b|\\b',words2keep),'\\b)\\b[A-Za-z]+\\b'),'',text);
[1] "ok אני מסכים איתך "

Use gsub function with [A-Z] All uppercase A to Z letters will be removed, for total word removal use .* for total character removal
gsub("[A-Z].*","",text)
[1] "ok אני מסכים איתך "
#data
text = "ok אני מסכים איתך Yossi Cohen"

Related

Regex for text (and numbers and special characters) between multiple commas [duplicate]

I'm going nuts trying to get a regex to detect spam of keywords in the user inputs. Usually there is some normal text at the start and the keyword spam at the end, separated by commas or other chars.
What I need is a regex to count the number of keywords to flag the text for a human to check it.
The text is usually like this:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8...
I've tried several regex to count the matches:
-This only gets one out of two keywords
[,-](\w|\s)+[,-]
-This also matches the random text
(?:([^,-]*)(?:[^,-]|$))
Can anyone tell me a regex to do this? Or should I take a different approach?
Thanks!
Pr your answer to my question, here is a regexp to match a string that occurs between two commas.
(?<=,)[^,]+(?=,)
This regexp does not match, and hence do not consume, the delimiting commas.
This regexp would match " and hence do not consume" in the previous sentence.
The fact that your regexp matched and consumed the commas was the reason why your attempted regexp only matched every other candidate.
Also if the whole input is a single string you will want to prevent linebreaks. In that case you will want to use;
(?<=,)[^,\n]+(?=,)
http://www.phpliveregex.com/p/1DJ
As others have said this is potentially a very tricky thing to do... It suffers from all of the same failures as general "word filtering" (e.g. people will "mask" the input). It is made even more difficult without plenty of example posts to test against...
Solution
Anyway, assuming that keywords will be on separate lines to the rest of the input and separated by commas you can match the lines with keywords in like:
Regex
#(?:^)((?:(?:[\w\.]+)(?:, ?|$))+)#m
Input
Taken from your question above:
[random text, with commas, dots and all]
keyword1, keyword2, keyword3, keyword4, keyword5,
Keyword6, keyword7, keyword8
Output
// preg_match_all('#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m', $string, $matches);
// var_dump($matches);
array(2) {
[0]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8..."
}
[1]=>
array(2) {
[0]=>
string(49) "keyword1, keyword2, keyword3, keyword4, keyword5,"
[1]=>
string(31) "Keyword6, keyword7, keyword8"
}
}
Explanation
#(?:^)((?:(?:[\w]+)(?:, ?|$))+)#m
# => Starting delimiter
(?:^) => Matches start of line in a non-capturing group (you could just use ^ I was using |\n originally and didn't update)
( => Start a capturing group
(?: => Start a non-capturing group
(?:[\w]+) => A non-capturing group to match one or more word characters a-zA-Z0-9_ (Using a character class so that you can add to it if you need to....)
(?:, ?|$) => A non-capturing group to match either a comma (with an optional space) or the end of the string/line
)+ => End the non-capturing group (4) and repeat 5/6 to find multiple matches in the line
) => Close the capture group 3
# => Ending delimiter
m => Multi-line modifier
Follow up from number 2:
#^((?:(?:[\w]+)(?:, ?|$))+)#m
Counting keywords
Having now returned an array of lines only containing key words you can count the number of commas and thus get the number of keywords
$key_words = implode(', ', $matches[1]); // Join lines returned by preg_match_all
echo substr_count($key_words, ','); // 8
N.B. In most circumstances this will return NUMBER_OF_KEY_WORDS - 1 (i.e. in your case 7); it returns 8 because you have a comma at the end of your first line of key words.
Links
http://php.net/manual/en/reference.pcre.pattern.modifiers.php
http://www.regular-expressions.info/
http://php.net/substr_count
Why not just use explode and trim?
$keywords = array_map ('trim', explode (',', $keywordstring));
Then do a count() on $keywords.
If you think keywords with spaces in are spam, then you can iterate of the $keywords array and look for any that contain whitespace. There might be legitimate reasons for having spaces in a keyword though. If you're talking about superheroes on your system, for example, someone might enter The Tick or Iron Man as a keyword
I don't think counting keywords and looking for spaces in keywords are really very good strategies for detecting spam though. You might want to look into other bot protection strategies instead, or even use manual moderation.
How to match on the String of text between the commas?
This SO Post was marked as a duplicate to my posted question however since it is NOT a duplicate and there were no answers in THIS SO Post that answered my question on how to also match on the strings between the commas see below on how to take this a step further.
How to Match on single digit values in a CSV String
For example if the task is to search the string within the commas for a single 7, 8 or a single 9 but not match on combinations such as 17 or 77 or 78 but only the single 7s, 8s, or 9s see below...
The answer is to Use look arounds and place your search pattern within the look arounds:
(?<=^|,)[789](?=,|$)
See live demo.
The above Pattern is more concise however I've pasted below the Two Patterns provided as solutions to THIS this question of matching on Strings within the commas and they are:
(?<=^|,)[789](?=,|$) Provided by #Bohemian and chosen as the Correct Answer
(?:(?<=^)|(?<=,))[789](?:(?=,)|(?=$)) Provided in comments by #Ouroborus
Demo: https://regex101.com/r/fd5GnD/1
Your first regexp doesn't need a preceding comma
[\w\s]+[,-]
A regex that will match strings between two commas or start or end of string is
(?<=,|^)[^,]*(?=,|$)
Or, a bit more efficient:
(?<![^,])[^,]*(?![^,])
See the regex demo #1 and demo #2.
Details:
(?<=,|^) / (?<![^,]) - start of string or a position immediately preceded with a comma
[^,]* - zero or more chars other than a comma
(?=,|$) / (?![^,]) - end of string or a position immediately followed with a comma
If people still search for this in 2021
([^,\n])+
Match anything except new line and comma
regexr.com/60eme
I think the difficulty is that the random text can also contain commas.
If the keywords are all on one line and it is the last line of the text as a whole, trim the whole text removing new line characters from the end. Then take the text from the last new line character to the end. This should be your string containing the keywords. Once you have this part singled out, you can explode the string on comma and count the parts.
<?php
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3
";
$lastEOL = strrpos(trim($string), PHP_EOL);
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
I know it is not a regex, but I hope it helps nevertheless.
The only way to find a solution, is to find something that separates the random text and the keywords that is not present in the keywords. If a new line is present in the keywords, you can not use it. But are 2 consecutive new lines? Or any other characters.
$string = " some gibberish, some more gibberish, and random text
keyword1, keyword2, keyword3,
keyword4, keyword5, keyword6,
keyword7, keyword8, keyword9
";
$lastEOL = strrpos(trim($string), PHP_EOL . PHP_EOL); // 2 end of lines after random text
$keywordLine = substr($string, $lastEOL);
$keywords = explode(',', $keywordLine);
echo "Number of keywords: " . count($keywords);
(edit: added example for more new lines - long shot)

python 3 regex string matching ignore whitespace and string.punctuation

I am new to regex and would like to know how to pattern match two strings. The use case would be something like finding a certain phrase in some text. I'm using python 3.7 if that makes a difference.
phrase = "some phrase" #the phrase I'm searching for
Possible matches:
text = "some##$#phrase"
^^^^ #non-alphanumeric can be treated like a single space
text = "some phrase"
text = "!!!some!!! phrase!!!"
These are not matches:
text = "some phrases"
^ #the 's' on the end makes it false
text = "ssome phrase"
text = "some other phrase"
I have tried using something like:
re.search(r'\b'+phrase+'\b', text)
I would very much appreciate an explanation of why the regex works if you provide a valid solution.
You should use something like this:
re.search(r'\bsome\W+phrase\b', text)
'\W' means non-word character
'+' means one or more times
In case you have a given phrase in a variable, you could try this before:
some_phrase = some_phrase.replace(r' ', r'\W+')

How to gsub on the text between two words in R?

EDIT:
I would like to place a \n before a specific unknown word in my text. I know that the first time the unknown word appears in my text will be between "Tree" and "Lake"
Ex. of text:
text
[1] "TreeRULakeSunWater"
[2] "A B C D"
EDIT:
"Tree" and "Lake" will never change, but the word in between them is always changing so I do not look for "RU" in my regex
What I am currently doing:
if (grepl(".*Tree\\s*|Lake.*", text)) { text <- gsub(".*Tree\\s*|Lake.*", "\n\\1", text)}
The problem with what I am doing above is that the gsub will sub all of text and leave just \nRU.
text
[1] "\nRU"
I have also tried:
if (grepl(".*Tree *(.*?) *Lake.*", text)) { text <- gsub(".*Tree *(.*?) *Lake.*", "\n\\1", text)}
What I would like text to look like after gsub:
text
[1] "Tree \nRU LakeSunWater"
[2] "A B C D"
EDIT:
From Wiktor Stribizew's comment I am able to do a successful gsub
gsub("Tree(\\w+)Lake", "Tree \n\\1 Lake", text)
But this will only do a gsub on occurrences where "RU" is between "Tree and "Lake", which is the first occurrence of the unknown word. The unknown word and in this case "RU" will show up many times in the text, and I would like to place \n in front of every occurrence of "RU" when "RU" is a whole word.
New Ex. of text.
text
[1] "TreeRULakeSunWater"
[2] "A B C RU D"
New Ex. of what I would like:
text
[1] "Tree \nRU LakeSunWater"
[2] "A B C \nRU D"
Any help will be appreciated. Please let me know if further information is needed.
You need to find the unknown word between "Tree" and "Lake" first. You can use
unknown_word <- gsub(".*Tree(\\w+)Lake.*", "\\1", text)
The pattern matches any characters up to the last Tree in a string, then captures the unknown word (\w+ = one or more word characters) up to the Lake and then matches the rest of the string. It replaces all the strings in the vector. You can access the first one by [[1]] index.
Then, when you know the word, replace it with
gsub(paste0("[[:space:]]*(", unknown_word[[1]], ")[[:space:]]*"), " \n\\1 ", text)
See IDEONE demo.
Here, you have [[:space:]]*( + unknown_word[1] + )[[:space:]]* pattern. It matches zero or more whitespaces on both ends of the unknown word, and the unknown word itself (captured into Group 1). In the replacement, the spaces are shrunk into 1 (or added if there were none) and then \\1 restores the unknown word. You may replace [[:space:]] with \\s.
UPDATE
If you need to only add a newline symbols before RU that are whole words, use the \b word boundary:
> gsub(paste0("[[:space:]]*\\b(", unknown_word[[1]], ")\\b[[:space:]]*"), " \n\\1 ", text)
[1] "TreeRULakeSunWater" "A B C \nRU D"

How to Extract a substring that matches a Perticular Regular expression match from a String in R

I am trying to write a function so that i can get all the substrings from a string that matches a regular expression , example : -
str <- "hello Brother How are you"
I want to extract all the substrings from str , where those substrings matches this regular expression - "[A-z]+ [A-z]+"
which results in -
"hello Brother"
"Brother How"
"How are"
"are you"
is there any library function which can do that ?
You can do it with stringr library str_match_all function and the method Tim Pietzcker described in his answer (capturing inside an unanchored positive lookahead):
> library(stringr)
> str <- "hello Brother How are you"
> res <- str_match_all(str, "(?=\\b([[:alpha:]]+ [[:alpha:]]+))")
> l <- unlist(res)
> l[l != ""]
## [1] "hello Brother" "Brother How" "How are" "are you"
Or to only get unqiue values:
> unique(l[l != ""])
##[1] "hello Brother" "Brother How" "How are" "are you"
I just advise to use [[:alpha:]] instead of [A-z] since this pattern matches more than just letters.
Regex matches "consume" the text they match, therefore (generally) the same bit of text can't match twice. But there are constructs called lookaround assertions which don't consume the text they match, and which may contain capturing groups.
That makes your endeavor possible (although you can't use [A-z], that doesn't do what you think it does):
(?=\b([A-Za-z]+ [A-Za-z]+))
will match as expected; you need to look at group 1 of the match result, not the matched text itself (which will always be empty).
The \b word boundary anchor is necessary to ensure that our matches always start at the beginning of a word (otherwise you'd also have the results "ello Brother", "llo Brother", "lo Brother", and "o Brother").
Test it live on regex101.com.

Regex Valid Twitter Mention

I'm trying to find a regex that matches if a Tweet it's a true mention. To be a mention, the string can't start with "#" and can't contain "RT" (case insensitive) and "#" must start the word.
In the examples I commented the desired output
Some examples:
function search($strings, $regexp) {
$regexp;
foreach ($strings as $string) {
echo "Sentence: \"$string\" <- " .
(preg_match($regexp, $string) ? "MATCH" : "NO MATCH") . "\n";
}
}
$strings = array(
"Hi #peter, I like your car ", // <- MATCH
"#peter I don't think so!", //<- NO MATCH: the string it's starting with # it's a reply
"Helo!! :# how are you!", // NO MATCH <- it's not a word, we need #(word)
"Yes #peter i'll eat them this evening! RT #peter: hey #you, do you want your pancakes?", // <- NO MATCH "RT/rt" on the string , it's a RT
"Helo!! ineed#aser.com how are you!", //<- NO MATCH, it doesn't start with #
"#peter is the best friend you could imagine. RT #juliet: #you do you know if #peter it's awesome?" // <- NO MATCH starting with # it's a reply and RT
);
echo "Example 1:\n";
search($strings, "/(?:[[:space:]]|^)#/i");
Current output:
Example 1:
Sentence: "Hi #peter, I like your car " <- MATCH
Sentence: "#peter I don't think so!" <- MATCH
Sentence: "Helo!! :# how are you!" <- NO MATCH
Sentence: "Yes #peter i'll eat them this evening! RT #peter: hey #you, do you want your pancakes?" <- MATCH
Sentence: "Helo!! ineed#aser.com how are you!" <- MATCH
Sentence: "#peter is the best friend you could imagine. RT #juliet: #you do you know if #peter it's awesome?" <- MATCH
EDIT:
I need it in regex beacause it can be used on MySQL and anothers
languages too. Im am not looking for any username. I only want to know
if the string it's a mention or not.
This regexp might work a bit better: /\B\#([\w\-]+)/gim
Here's a jsFiddle example of it in action: http://jsfiddle.net/2TQsx/96/
Here's a regex that should work:
/^(?!.*\bRT\b)(?:.+\s)?#\w+/i
Explanation:
/^ //start of the string
(?!.*\bRT\b) //Verify that rt is not in the string.
(?:.*\s)? //Find optional chars and whitespace the
//Note: (?: ) makes the group non-capturing.
#\w+ //Find # followed by one or more word chars.
/i //Make it case insensitive.
I have found that this is the best way to find mentions inside of a string in javascript. I don't know exactly how i would do the RT's but I think this might help with part of the problem.
var str = "#jpotts18 what is up man? Are you hanging out with #kyle_clegg";
var pattern = /#[A-Za-z0-9_-]*/g;
str.match(pattern);
["#jpotts18", "#kyle_clegg"]
I guess something like this will do it:
^(?!.*?RT\s).+\s#\w+
Roughly translated to:
At the beginning of string, look ahead to see that RT\s is not present, then find one or more of characters followed by a # and at least one letter, digit or underscore.
Twitter has published the regex they use in their twitter-text library. They have other language versions posted as well on GitHub.
A simple but works correctly even if the scraping tool has appended some special characters sometimes: (?<![\w])#[\S]*\b. This worked for me