Regex Valid Twitter Mention - regex

I'm trying to find a regex that matches if a Tweet it's a true mention. To be a mention, the string can't start with "#" and can't contain "RT" (case insensitive) and "#" must start the word.
In the examples I commented the desired output
Some examples:
function search($strings, $regexp) {
$regexp;
foreach ($strings as $string) {
echo "Sentence: \"$string\" <- " .
(preg_match($regexp, $string) ? "MATCH" : "NO MATCH") . "\n";
}
}
$strings = array(
"Hi #peter, I like your car ", // <- MATCH
"#peter I don't think so!", //<- NO MATCH: the string it's starting with # it's a reply
"Helo!! :# how are you!", // NO MATCH <- it's not a word, we need #(word)
"Yes #peter i'll eat them this evening! RT #peter: hey #you, do you want your pancakes?", // <- NO MATCH "RT/rt" on the string , it's a RT
"Helo!! ineed#aser.com how are you!", //<- NO MATCH, it doesn't start with #
"#peter is the best friend you could imagine. RT #juliet: #you do you know if #peter it's awesome?" // <- NO MATCH starting with # it's a reply and RT
);
echo "Example 1:\n";
search($strings, "/(?:[[:space:]]|^)#/i");
Current output:
Example 1:
Sentence: "Hi #peter, I like your car " <- MATCH
Sentence: "#peter I don't think so!" <- MATCH
Sentence: "Helo!! :# how are you!" <- NO MATCH
Sentence: "Yes #peter i'll eat them this evening! RT #peter: hey #you, do you want your pancakes?" <- MATCH
Sentence: "Helo!! ineed#aser.com how are you!" <- MATCH
Sentence: "#peter is the best friend you could imagine. RT #juliet: #you do you know if #peter it's awesome?" <- MATCH
EDIT:
I need it in regex beacause it can be used on MySQL and anothers
languages too. Im am not looking for any username. I only want to know
if the string it's a mention or not.

This regexp might work a bit better: /\B\#([\w\-]+)/gim
Here's a jsFiddle example of it in action: http://jsfiddle.net/2TQsx/96/

Here's a regex that should work:
/^(?!.*\bRT\b)(?:.+\s)?#\w+/i
Explanation:
/^ //start of the string
(?!.*\bRT\b) //Verify that rt is not in the string.
(?:.*\s)? //Find optional chars and whitespace the
//Note: (?: ) makes the group non-capturing.
#\w+ //Find # followed by one or more word chars.
/i //Make it case insensitive.

I have found that this is the best way to find mentions inside of a string in javascript. I don't know exactly how i would do the RT's but I think this might help with part of the problem.
var str = "#jpotts18 what is up man? Are you hanging out with #kyle_clegg";
var pattern = /#[A-Za-z0-9_-]*/g;
str.match(pattern);
["#jpotts18", "#kyle_clegg"]

I guess something like this will do it:
^(?!.*?RT\s).+\s#\w+
Roughly translated to:
At the beginning of string, look ahead to see that RT\s is not present, then find one or more of characters followed by a # and at least one letter, digit or underscore.

Twitter has published the regex they use in their twitter-text library. They have other language versions posted as well on GitHub.

A simple but works correctly even if the scraping tool has appended some special characters sometimes: (?<![\w])#[\S]*\b. This worked for me

Related

gsub with exception in R

I'm removing English characters from Hebrew text but would like to keep a short list of English words that i want, e.g. words2keep <- c("ok", "hello", "yes*").
So my current regex is text <- gsub("[A-Z,a-z]", "", text) , but the question is how to add the exception so it will not remove all English words.
reproducibe example:
text = "ok אני מסכים איתך Yossi Cohen"
after gsub with exception
text = "ok אני מסכים איתך"
Thank you for all suggestions
This is a tricky one. I think we can do it by matching against whole words by making use of the \b word boundary assertion, and at the same time include a negative lookahead assertion just prior to the match which rejects the words (again, whole words) that you want to blacklist for removal (or equivalently whitelist for preservation). This appears to be working:
gsub(perl=T,paste0('(?!\\b',paste(collapse='\\b|\\b',words2keep),'\\b)\\b[A-Za-z]+\\b'),'',text);
[1] "ok אני מסכים איתך "
Use gsub function with [A-Z] All uppercase A to Z letters will be removed, for total word removal use .* for total character removal
gsub("[A-Z].*","",text)
[1] "ok אני מסכים איתך "
#data
text = "ok אני מסכים איתך Yossi Cohen"

How to Extract a substring that matches a Perticular Regular expression match from a String in R

I am trying to write a function so that i can get all the substrings from a string that matches a regular expression , example : -
str <- "hello Brother How are you"
I want to extract all the substrings from str , where those substrings matches this regular expression - "[A-z]+ [A-z]+"
which results in -
"hello Brother"
"Brother How"
"How are"
"are you"
is there any library function which can do that ?
You can do it with stringr library str_match_all function and the method Tim Pietzcker described in his answer (capturing inside an unanchored positive lookahead):
> library(stringr)
> str <- "hello Brother How are you"
> res <- str_match_all(str, "(?=\\b([[:alpha:]]+ [[:alpha:]]+))")
> l <- unlist(res)
> l[l != ""]
## [1] "hello Brother" "Brother How" "How are" "are you"
Or to only get unqiue values:
> unique(l[l != ""])
##[1] "hello Brother" "Brother How" "How are" "are you"
I just advise to use [[:alpha:]] instead of [A-z] since this pattern matches more than just letters.
Regex matches "consume" the text they match, therefore (generally) the same bit of text can't match twice. But there are constructs called lookaround assertions which don't consume the text they match, and which may contain capturing groups.
That makes your endeavor possible (although you can't use [A-z], that doesn't do what you think it does):
(?=\b([A-Za-z]+ [A-Za-z]+))
will match as expected; you need to look at group 1 of the match result, not the matched text itself (which will always be empty).
The \b word boundary anchor is necessary to ensure that our matches always start at the beginning of a word (otherwise you'd also have the results "ello Brother", "llo Brother", "lo Brother", and "o Brother").
Test it live on regex101.com.

Exact string matching in r

I struggling with exact string matching in R. I need only exact match in sentece with searched string:
sentence2 <- "laptop is a great product"
words2 <- c("top","laptop")
I was trying something like this:
sub(paste(c("^",words2,"$")),"",sentence2)
and I need replace laptop by empty string only - for exact match (laptop) but didn't work...
Please, could you help me. Thanks in advance.
Desired output:
is a great product
You can try:
gsub(paste0("^",words2," ",collapse="|"),"",sentence2)
#[1] "is a great product"
The result of paste0("^",words2," ",collapse="|") is "^top |^laptop " which means "either 'top' at the beginning of string followed by a space or 'laptop' at the beginning of string followed by a space".
If you want to match entire words, then you can use \\b to match word boundaries.
gsub(paste0('\\b', words2, '\\b', collapse='|'), '', sentence2)
## [1] " is a great product"
Add optional whitespace to the pattern if you want to replace the adjacent spaces as well.
gsub(paste0('\\s*\\b', words2, '\\b\\s*', collapse='|'), '', sentence2)
## [1] "is a great product"

Matching non commented pattern in eclipse

I am having troubles with a regex syntax.
I want to match all occurrences of a certain word followed by a number, but exclude lines which are commented.
Comments are (multiple) # or ## or ### ...
Examples:
#This is a comment <- no match
#This is a comment myword 8 <- no match
my $var = 'myword 12'; <- match
my $var2 = 'myword'; <- no match
Until now I have
orignal pattern: ^[^(\#+)](.*?)(myword \d+)(.*?)$
new pattern: ^([^\#]*?)(myword\s+\d+)(.*?)$
Which should match lines which do no begin with one or more #, followed by something, then the word number combination I am searching for and finally something.
It would perhaps be good to match also parts of lines if the comment does not begin at the beginning of the line.
my $var3 = 'test';#myword 8 <- no match
What am I doing wrong?
I want to use it in Eclipse's file search (with Perl epic module).
Edit: The new pattern I got does no return false matches, but it return multiple the line which includes myword and several lines before that line. And I'm not sure it returns all matches.
Note that [] are character classes. You cannot use quantifiers in there. They are like the . – matches any character given in there. The dot itself, or a character class, can then be quantified.
In your example, [^(#+)] would match everything except (,), +, and depending on the flavour (I guess) # and \.
So what you want here is to match a line that starts with any character except for a #. (I think.)
A problem is that the # might occur in a string where it is not a comment. (Regarding comments not starting at the beginning of the line.)
Re: comments not at the beginning of the string.
To do this right (e.g. not to miss any valid matches) you pretty much have to parse a file's specific programming language's grammar properly, so you can't do this (easily, or even at all) with a RegEx.
If you don't, you risk missing valid search hits that follow a "#" used in a context other than comment start - as an example common to pretty much any language, after a string "this is my #hash".
It's even worse in Perl where "#" can also appear as a regex delimiter, as a $#myArr (index of the last element of an array), or - joy of joys - as a valid character in an identifyer name!
Of course, if You are aware of these problems and still want to use regexp to extract the content. Something like this may be useful:
^[^\#].[^\n\#]+myword\s\d+.[$;]+
This is a little bit complex but I hope it will works for You.
For me this matches as below:
my $var = 'myword 12'; <- match
my $var = 'myword 17'; <- match
my $var2 = 'myword'; <- no match
my $var = 'myword 9'; #'myword 17'; <- partly match
my $var = 'myword 8'; ##'myword 127'; <- partly match
my $var = ;#'myword 17'; <- no match
#my $var = 'myword 13'; <- no match
##my $var2 = 'myword 14'; <- no match

Regex for quoted string with escaping quotes

How do I get the substring " It's big \"problem " using a regular expression?
s = ' function(){ return " It\'s big \"problem "; }';
/"(?:[^"\\]|\\.)*"/
Works in The Regex Coach and PCRE Workbench.
Example of test in JavaScript:
var s = ' function(){ return " Is big \\"problem\\", \\no? "; }';
var m = s.match(/"(?:[^"\\]|\\.)*"/);
if (m != null)
alert(m);
This one comes from nanorc.sample available in many linux distros. It is used for syntax highlighting of C style strings
\"(\\.|[^\"])*\"
As provided by ePharaoh, the answer is
/"([^"\\]*(\\.[^"\\]*)*)"/
To have the above apply to either single quoted or double quoted strings, use
/"([^"\\]*(\\.[^"\\]*)*)"|\'([^\'\\]*(\\.[^\'\\]*)*)\'/
Most of the solutions provided here use alternative repetition paths i.e. (A|B)*.
You may encounter stack overflows on large inputs since some pattern compiler implements this using recursion.
Java for instance: http://bugs.java.com/bugdatabase/view_bug.do?bug_id=6337993
Something like this:
"(?:[^"\\]*(?:\\.)?)*", or the one provided by Guy Bedford will reduce the amount of parsing steps avoiding most stack overflows.
/(["\']).*?(?<!\\)(\\\\)*\1/is
should work with any quoted string
"(?:\\"|.)*?"
Alternating the \" and the . passes over escaped quotes while the lazy quantifier *? ensures that you don't go past the end of the quoted string. Works with .NET Framework RE classes
/"(?:[^"\\]++|\\.)*+"/
Taken straight from man perlre on a Linux system with Perl 5.22.0 installed.
As an optimization, this regex uses the 'posessive' form of both + and * to prevent backtracking, for it is known beforehand that a string without a closing quote wouldn't match in any case.
This one works perfect on PCRE and does not fall with StackOverflow.
"(.*?[^\\])??((\\\\)+)?+"
Explanation:
Every quoted string starts with Char: " ;
It may contain any number of any characters: .*? {Lazy match}; ending with non escape character [^\\];
Statement (2) is Lazy(!) optional because string can be empty(""). So: (.*?[^\\])??
Finally, every quoted string ends with Char("), but it can be preceded with even number of escape sign pairs (\\\\)+; and it is Greedy(!) optional: ((\\\\)+)?+ {Greedy matching}, bacause string can be empty or without ending pairs!
An option that has not been touched on before is:
Reverse the string.
Perform the matching on the reversed string.
Re-reverse the matched strings.
This has the added bonus of being able to correctly match escaped open tags.
Lets say you had the following string; String \"this "should" NOT match\" and "this \"should\" match"
Here, \"this "should" NOT match\" should not be matched and "should" should be.
On top of that this \"should\" match should be matched and \"should\" should not.
First an example.
// The input string.
const myString = 'String \\"this "should" NOT match\\" and "this \\"should\\" match"';
// The RegExp.
const regExp = new RegExp(
// Match close
'([\'"])(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))' +
'((?:' +
// Match escaped close quote
'(?:\\1(?=(?:[\\\\]{2})*[\\\\](?![\\\\])))|' +
// Match everything thats not the close quote
'(?:(?!\\1).)' +
'){0,})' +
// Match open
'(\\1)(?!(?:[\\\\]{2})*[\\\\](?![\\\\]))',
'g'
);
// Reverse the matched strings.
matches = myString
// Reverse the string.
.split('').reverse().join('')
// '"hctam "\dluohs"\ siht" dna "\hctam TON "dluohs" siht"\ gnirtS'
// Match the quoted
.match(regExp)
// ['"hctam "\dluohs"\ siht"', '"dluohs"']
// Reverse the matches
.map(x => x.split('').reverse().join(''))
// ['"this \"should\" match"', '"should"']
// Re order the matches
.reverse();
// ['"should"', '"this \"should\" match"']
Okay, now to explain the RegExp.
This is the regexp can be easily broken into three pieces. As follows:
# Part 1
(['"]) # Match a closing quotation mark " or '
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
# Part 2
((?: # Match inside the quotes
(?: # Match option 1:
\1 # Match the closing quote
(?= # As long as it's followed by
(?:\\\\)* # A pair of escape characters
\\ #
(?![\\]) # As long as that's not followed by an escape
) # and a single escape
)| # OR
(?: # Match option 2:
(?!\1). # Any character that isn't the closing quote
)
)*) # Match the group 0 or more times
# Part 3
(\1) # Match an open quotation mark that is the same as the closing one
(?! # As long as it's not followed by
(?:[\\]{2})* # A pair of escape characters
[\\] # and a single escape
(?![\\]) # As long as that's not followed by an escape
)
This is probably a lot clearer in image form: generated using Jex's Regulex
Image on github (JavaScript Regular Expression Visualizer.)
Sorry, I don't have a high enough reputation to include images, so, it's just a link for now.
Here is a gist of an example function using this concept that's a little more advanced: https://gist.github.com/scagood/bd99371c072d49a4fee29d193252f5fc#file-matchquotes-js
here is one that work with both " and ' and you easily add others at the start.
("|')(?:\\\1|[^\1])*?\1
it uses the backreference (\1) match exactley what is in the first group (" or ').
http://www.regular-expressions.info/backref.html
One has to remember that regexps aren't a silver bullet for everything string-y. Some stuff are simpler to do with a cursor and linear, manual, seeking. A CFL would do the trick pretty trivially, but there aren't many CFL implementations (afaik).
A more extensive version of https://stackoverflow.com/a/10786066/1794894
/"([^"\\]{50,}(\\.[^"\\]*)*)"|\'[^\'\\]{50,}(\\.[^\'\\]*)*\'|“[^”\\]{50,}(\\.[^“\\]*)*”/
This version also contains
Minimum quote length of 50
Extra type of quotes (open “ and close ”)
If it is searched from the beginning, maybe this can work?
\"((\\\")|[^\\])*\"
I faced a similar problem trying to remove quoted strings that may interfere with parsing of some files.
I ended up with a two-step solution that beats any convoluted regex you can come up with:
line = line.replace("\\\"","\'"); // Replace escaped quotes with something easier to handle
line = line.replaceAll("\"([^\"]*)\"","\"x\""); // Simple is beautiful
Easier to read and probably more efficient.
If your IDE is IntelliJ Idea, you can forget all these headaches and store your regex into a String variable and as you copy-paste it inside the double-quote it will automatically change to a regex acceptable format.
example in Java:
String s = "\"en_usa\":[^\\,\\}]+";
now you can use this variable in your regexp or anywhere.
(?<="|')(?:[^"\\]|\\.)*(?="|')
" It\'s big \"problem "
match result:
It\'s big \"problem
("|')(?:[^"\\]|\\.)*("|')
" It\'s big \"problem "
match result:
" It\'s big \"problem "
Messed around at regexpal and ended up with this regex: (Don't ask me how it works, I barely understand even tho I wrote it lol)
"(([^"\\]?(\\\\)?)|(\\")+)+"