Avoid twitter profile names (#Profile) with regex - regex

I am trying to analyze tweets but want to avoid the profile users names that are followed by an # (#Profile_name) using regex!
I've tried:
re.findall(r'(?!#[\w+]*)(\w+)', "I want to take everything but #this, but I cannot find a way"))
and it gives me:
>>>> [['I', 'want', 'to', 'take', 'everything', 'but', 'this', 'but', 'I', 'cannot', 'find', 'a', 'way']]
I don't want the "this" :/
I'm quite new in regex, but I really cannot solve this one.
Thanks!

Try re.sub
re.sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged. repl can be a string or a function; if it is a string, any backslash escapes in it are processed. That is, \n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern.
>>> re.sub(r'(#\w+)', "", "I want to take everything but #this, but I cannot find a way")
'I want to take everything but , but I cannot find a way'

Related

Shorten Regular Expression (\n) [duplicate]

I'd like to match three-character sequences of letters (only letters 'a', 'b', 'c' are allowed) separated by comma (last group is not ended with comma).
Examples:
abc,bca,cbb
ccc,abc,aab,baa
bcb
I have written following regular expression:
re.match('([abc][abc][abc],)+', "abc,defx,df")
However it doesn't work correctly, because for above example:
>>> print bool(re.match('([abc][abc][abc],)+', "abc,defx,df")) # defx in second group
True
>>> print bool(re.match('([abc][abc][abc],)+', "axc,defx,df")) # 'x' in first group
False
It seems only to check first group of three letters but it ignores the rest. How to write this regular expression correctly?
Try following regex:
^[abc]{3}(,[abc]{3})*$
^...$ from the start till the end of the string
[...] one of the given character
...{3} three time of the phrase before
(...)* 0 till n times of the characters in the brackets
What you're asking it to find with your regex is "at least one triple of letters a, b, c" - that's what "+" gives you. Whatever follows after that doesn't really matter to the regex. You might want to include "$", which means "end of the line", to be sure that the line must all consist of allowed triples. However in the current form your regex would also demand that the last triple ends in a comma, so you should explicitly code that it's not so.
Try this:
re.match('([abc][abc][abc],)*([abc][abc][abc])$'
This finds any number of allowed triples followed by a comma (maybe zero), then a triple without a comma, then the end of the line.
Edit: including the "^" (start of string) symbol is not necessary, because the match method already checks for a match only at the beginning of the string.
The obligatory "you don't need a regex" solution:
all(letter in 'abc,' for letter in data) and all(len(item) == 3 for item in data.split(','))
You need to iterate over sequence of found values.
data_string = "abc,bca,df"
imatch = re.finditer(r'(?P<value>[abc]{3})(,|$)', data_string)
for match in imatch:
print match.group('value')
So the regex to check if the string matches pattern will be
data_string = "abc,bca,df"
match = re.match(r'^([abc]{3}(,|$))+', data_string)
if match:
print "data string is correct"
Your result is not surprising since the regular expression
([abc][abc][abc],)+
tries to match a string containing three characters of [abc] followed by a comma one ore more times anywhere in the string. So the most important part is to make sure that there is nothing more in the string - as scessor suggests with adding ^ (start of string) and $ (end of string) to the regular expression.
An alternative without using regex (albeit a brute force way):
>>> def matcher(x):
total = ["".join(p) for p in itertools.product(('a','b','c'),repeat=3)]
for i in x.split(','):
if i not in total:
return False
return True
>>> matcher("abc,bca,aaa")
True
>>> matcher("abc,bca,xyz")
False
>>> matcher("abc,aaa,bb")
False
If your aim is to validate a string as being composed of triplet of letters a,b,and c:
for ss in ("abc,bbc,abb,baa,bbb",
"acc",
"abc,bbc,abb,bXa,bbb",
"abc,bbc,ab,baa,bbb"):
print ss,' ',bool(re.match('([abc]{3},?)+\Z',ss))
result
abc,bbc,abb,baa,bbb True
acc True
abc,bbc,abb,bXa,bbb False
abc,bbc,ab,baa,bbb False
\Z means: the end of the string. Its presence obliges the match to be until the very end of the string
By the way, I like the form of Sonya too, in a way it is clearer:
bool(re.match('([abc]{3},)*[abc]{3}\Z',ss))
To just repeat a sequence of patterns, you need to use a non-capturing group, a (?:...) like contruct, and apply a quantifier right after the closing parenthesis. The question mark and the colon after the opening parenthesis are the syntax that creates a non-capturing group (SO post).
For example:
(?:abc)+ matches strings like abc, abcabc, abcabcabc, etc.
(?:\d+\.){3} matches strings like 1.12.2., 000.00000.0., etc.
Here, you can use
^[abc]{3}(?:,[abc]{3})*$
^^
Note that using a capturing group is fraught with unwelcome effects in a lot of Python regex methods. See a classical issue described at re.findall behaves weird post, for example, where re.findall and all other regex methods using this function behind the scenes only return captured substrings if there is a capturing group in the pattern.
In Pandas, it is also important to use non-capturing groups when you just need to group a pattern sequence: Series.str.contains will complain that this pattern has match groups. To actually get the groups, use str.extract. and
the Series.str.extract, Series.str.extractall and Series.str.findall will behave as re.findall.

regex that will check expression is valid arithmetic expression, for example 'a+b*c=d' is valid and accepted

I made separate regexs for both but its not giving desired result. and it should work like check whole input string and return valid if its valid or invalid if its invalid.
import re
identifiers = r'^[^\d\W]\w*\Z'
operators = r'[\+\*\-\/=]'
a = re.compile(identifiers, re.UNICODE)
b = re.compile(operators, re.UNICODE)
c = re.findall(a, 'a+b*c=d')
d = re.findall(b, 'a+b*c=d')
print c, 'identifiers'
print d, 'operators'
Result of this snippet is
[ ] identifiers &
['+', '*', '='] operators
I want results like input string is valid or invalid by checking all characters of input string by both regex
I think the issue you're having with your current code is that your identifiers pattern only works if it matches the whole string.
The problem is that the current pattern requires that both the beginning and end of the input be matched (by the ^ and \Z respectively). That's usually causing you to not finding any identifiers, since only an input like "foo" would be matched, since it's a single identifier that contacts both the start and end of the string. (I'd also note that it is a bit odd to mix ^ and \Z together, though it is not invalid. It would just be more natural to pair ^ with $ or \Z with \A.)
I suspect that you don't actually want ^ and \Z in your pattern, but rather should be using \b in place of both. The \b escape matches "word breaks", which means either the start or end of the input, or a change between word-characters and non-word characters.
>>> re.findall(r'\b[^\d\W]\w*\b', 'a+b*c=d', re.U)
['a', 'b', 'c', 'd']
This still isn't going to do what you say you ultimately want (testing if the string to ensure it's a valid expression). That's a much more difficult task, and regular expressions are not up to it in general. Certain specific forms of expressions can perhaps be matched with regex, but supporting things like parentheses will break the whole system in a hurry. To identify arbitrary arithmetic expressions, you'd need a more sophisticated parser, which might use regex in some of it's steps, but not for the whole thing.
For the simple cases like your example an expression like this will work:
^[0-9a-z]+([+/*-][0-9a-z]+)+=[0-9a-z]+$

Regex parse a command line string but don't return spaces between quotes

I am using python to parse a string that is passed in by the optparse module.
I want to split the string on certain delimiters but not in between quote marks.
A sample string is:
--state-basedir /dir/dir/dir/ --cmd=\"param load $v2param\" --master=/dev/ttyUSB0 --console --map --out=udp:192.168.1.1:14550
This string is passed in as a single optparse argument, I am then going to pass it to another process.
I have been trying various things at http://pythex.org/
The closest I have gotten is:
`(?<!")[\s=](?![\s0-9a-zA-Z\$\\]*")`
The issue is that the = sign after --cmd and the space before --master are not matched.
In plain English, this is how I am reading my regex:
match either a space character or an equal character as long as it is not preceded by a quotation mark and as long as it is not proceeded by a combination of any other letter,numbers,punctuation and another quotation mark
I had a feeling that there was something else I was missing, like greediness, so I tried adding ? after my look-ahead and look-behind terms. If I put a ? after my look-behind one I can get the space before --master but if I put the ? after my look-ahead term I get the spaces in the quotation marks now, which I don't want.
The idea here is that I am going to use re.split to handle things.
Thanks for any explanations as to what I am doing wrong.
This is not a regex answer and it's also not pretty, but it is one line.
sum([[x] if '"' in x else re.split(' |=',x) for x in re.split('=(\".+?\" )',a)],[])
output:
['--state-basedir', '/dir/dir/dir/', '--cmd', '"param load $v2param" ', '--master', '/dev/ttyUSB0', '--console', '--map', '--out', 'udp:192.168.1.1:14550']
Starting from the re.split('=(\".+?\" )',a)] this splits out text surrounded by quotes (more specifically ="something another thing"). The split pieces are then split further with re.split(' |=',x) if they do not have a " in them, or are just returned as is [x] if they do. The last step is collapsing the resulting 2d list by overloading sum with sum(two_d_list,[]).
I hope this answer helps but I understand if it isn't what you're looking for

Working with capture groups in Elixir

In the documentation I've not found the description of how to use capture groups in Elixir. How can I do that? Say, I want to extract a substring from a string and replace it with something else:
~r"\[tag1\](.+?)\[\/tag1\]"
How can I access the string in between ] [/?
Use Regex.run/3 for 1 match, Regex.scan/3 for all matches, or check out other functions in the Regex module.
iex(1)> regex = ~r"\[tag1\](.+?)\[\/tag1\]"
~r/\[tag1\](.+?)\[\/tag1\]/
iex(2)> [_, inner] = Regex.run(regex, "[tag1]bar[/tag1]")
["[tag1]bar[/tag1]", "bar"]
iex(3)> inner
"bar"
iex(4) Regex.scan(regex, "[tag1]bar[/tag1] [tag1]baz[/tag1]")
[["[tag1]bar[/tag1]", "bar"], ["[tag1]baz[/tag1]", "baz"]]
The docs do a good job at hiding it, but it's there: https://hexdocs.pm/elixir/1.0.5/Regex.html#replace/4
The replacement can be either a string or a function. The string is used as a replacement for every match and it allows specific captures to be accessed via \N or \g{N}, where N is the capture. In case \0 is used, the whole match is inserted.
Note: the \ in \N needs to be escaped so it will be \\N
Use Regex.replace/4 (or String.replace/4 when piping because the string is the first argument) to do this with one command.
You do not actually need to capture what is in between] [\ in the first place. You only need to match and replace it. This solution uses a "lookaround" which you can find more information on here: http://www.regular-expressions.info/lookaround.html
iex(1)> String.replace("[tag1]foo[/tag1]", ~r"\w+(?=\[)", "bar")
"[tag1]bar[/tag1]"

Wordnet how to know if string is valid query string

So I'm having trouble calling functions from Wordnet::SenseRelate because some of the "words" in the text are not valid queries. I've tried surrounding with try and catch so that the program doesn't quit and skips it but no luck. I wanted to check if a word was valid by using Wordnet::QueryData but it will quit when i use an invalid word like:
$wn->querySense("#44");
I get:
(querySense) Bad query string: #44
The regex which is used can be found in the statement:
my ($word, $pos, $sense) = $string =~ /^([^\#]+)(?:\#([^\#]+)(?:\#(\d+))?)?$/;
If in doubt whether a token will be accepted, test it against this regex.
Commenting on the specific question, there cannot be any leading or trailing # characters (the problem experienced). If # characters are present, there can be 1 or 2 but not more than 2 in the query string. The # characters if present as as delimiters to determine what is word, what is pos and what is sense.