Comparing strings with regex - regex

I basically want to match strings like: "something", "some,thing", "some,one,thing", but I want to not match expressions like: ',thing', '_thing,' , 'some_thing'.
The pattern I want to match is: A string beginning with only letters and the rest of the body can be a comma, space or letters.
Here's what I did:
import re
x=re.compile('^[a-zA-z][a-zA-z, ]*') #there's space in the 2nd expression here
stri='some_thing'
x.match(str)
It gives me:
<_sre.SRE_Match object; span=(0, 4), match='some'>
The thing is, my regex somehow works but, it actually extracts the parts of the string that do match, but I want to compare the entire string with the regular expression pattern and return False if it does not match the pattern. How do I do this?

You use [a-Z] which matches more thank you think.
If you want to match [a-zA-Z] for both you might use the case insensitive flag:
import re
x=re.compile('^[a-z][a-z, ]*$', re.IGNORECASE)
stri='some,thing'
if x.match(stri):
print ("Match")
else:
print ("No match")
Test

the easiest way would be to just compare the result to the original string.
import re
x=re.compile('^[a-zA-z][a-zA-z, ]*')
str='some_thing'
x.match(str).group(0) == str #-> False
str = 'some thing'
x.match(str).group(0) == str #-> True

Related

Exclude words that contain my regular expression but are not my regular expression

I am trying to find a way of excluding the words that contain my regular expression, but are not my regular expression using the search method of a Text widget object. For example, suppose I have this regular expression "(if)|(def)", and words like define, definition or elif are all found by the re.search function, but I want a regular expression that finds exactly just if and def.
This is the code I am using:
import keyword
PY_KEYS = keyword.kwlist
PY_PATTERN = "^(" + ")|(".join(PY_KEYS) + ")$"
But it is still taking me words like define, but I want just words like def, even if define contains def.
I need this to highlight words in a tkinter.Text widget. The function I am using which is responsible for highlight the code is:
def highlight(self, event, pattern='', tag=KW, start=1.0, end="end", regexp=True):
"""Apply the given tag to all text that matches the given pattern
If 'regexp' is set to True, pattern will be treated as a regular
expression.
"""
if not isinstance(pattern, str) or pattern == '':
pattern = self.syntax_pattern # PY_PATTERN
# print(pattern)
start = self.index(start)
end = self.index(end)
self.mark_set("matchStart", start)
self.mark_set("matchEnd", start)
self.mark_set("searchLimit", end)
count = tkinter.IntVar()
while pattern != '':
index = self.search(pattern, "matchEnd", "searchLimit",
count=count, regexp=regexp)
# prints nothing
print(self.search(pattern, "matchEnd", "searchLimit",
count=count, regexp=regexp))
if index == "":
break
self.mark_set("matchStart", index)
self.mark_set("matchEnd", "%s+%sc" % (index, count.get()))
self.tag_add(tag, "matchStart", "matchEnd")
On the other hand, if PY_PATTERN = "\\b(" + "|".join(PY_KEYS) + ")\\b", then it highlights nothing, and you can see, if you put a print inside the function, that it's an empty string.
You can use anchors:
"^(?:if|def)$"
^ asserts position at the start of the string, and $ asserts position at the end of the string, asserting that nothing more can be matched unless the string is entirely if or def.
>>> import re
for foo in ["if", "elif", "define", "def", "in"]:
bar = re.search("^(?:if|def)$", foo)
print(foo, ' ', bar);
... if <_sre.SRE_Match object at 0x934daa0>
elif None
define None
def <_sre.SRE_Match object at 0x934daa0>
in None
You could use word boundaries:
"\b(if|def)\b"
The answers given are ok for Python's regular expression, but I have found in the meantime that the search method of a tkinter Text widget uses actually the Tcl's regular expressions style.
In this case, instead of wrapping the word or the regular expression with \b or \\b (if we are not using a raw string), we can simply use the corresponding Tcl word boundaries character, that is \y or \\y, which did the job in my case.
Watch my other question for more information.

regular expression to contain all strings that don't contain a pattern

I have a pattern 'NewTree' and I want to get all strings that don't contain this pattern 'NewTree'. How do I use regex to do the filter?
So if I have 1.BoostKite 2.SetTree 3. ComeNewTreeNow
Then the output should be BoostKite and SetTree.
Any suggestions? I wanted regex that can work anywhere and not use any language specific function.
You can try using a Negative Lookahead if you want to use a regular expression.
^(?!.*NewTree).*$
Live Demo
Alternatively you can use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
\w*NewTree\w*|([a-zA-Z]+)
Live Demo
In Python:
( The strings being in list context, as you commented 'array' above )
>>> import re
>>> regex = re.compile(r'^(?!.*NewTree).*$')
>>> mylst = ['BoostKite', 'SetTree', 'ComeNewTree', 'NewTree']
>>> matches = [x for x in mylst if regex.match(x)]
['BoostKite', 'SetTree']
If it is just a long string of multiple words and you want to ignore the words that contain NewTree
>>> s = '1.BoostKite 2.SetTree 3. ComeNewTreeNow 4. foo 5. bar'
>>> filter(None, re.findall(r'\w*NewTree\w*|([a-zA-Z]+)', s))
['BoostKite', 'SetTree', 'foo', 'bar']
You can do this without a regular expression as well.
>>> mylst = ['BoostKite', 'SetTree', 'ComeNewTree', 'NewTree']
>>> matches = [x for x in mylst if "NewTree" not in x]
['BoostKite', 'SetTree']
Match each word with the regex \w+NewTree\b. It returns true if it ends with NewTree
Use i modifier for case insensitive match (ignores case of [a-zA-Z])
Use \w* instead of \w+ in above regex if you want to match for NewTree word as well.
If you are looking for contains NewTree then try this regex \w*NewTree\w*\b
I think you can do this in general in the manner of the following example for your specific case:
^(([^N]|N[^e]|Ne[^w]|New[^T]|NewT[^r]|NewTr[^e]|NewTre[^e])+)?(.|..|...|....|.....)?$
So far what I have here is a near miss. It will not match any string that has substring NewTree. But it will not match every string that is free of the substring NewTree. In particular it will not match Nvwxyz.

negating with re.search (find strings that don't contain a specific character)

I'm trying to get re.search to find strings that don't have the letter p in them. My regex code returns everything in the list which is what I don't want. I wrote an alternate solution that gives me the exact results that I want, but I want to see if this can be solved with re.search, but I'll also accept another regex solution. I also tried re.findall and that didn't work, and re.match won't work because it looks for the pattern at the beginning of a string.
import re
someList = ['python', 'ppython', 'ython', 'cython', '.python', '.ythop', 'zython', 'cpython', 'www.python.org', 'xyzthon', 'perl', 'javap', 'c++']
# this returns everything from the source list which is what I DON'T want
pattern = re.compile('[^p]')
result = []
for word in someList:
if pattern.search(word):
result.append(word)
print '\n', result
''' ['python', 'ppython', 'ython', 'cython', '.python', '.ythop', 'zython', 'cpython', 'www.python.org', 'xyzthon', 'perl', 'javap', 'c++'] '''
# this non regex solution returns the results I want
cnt = 0; no_p = []
for word in someList:
for letter in word:
if letter == 'p':
cnt += 1
pass
if cnt == 0:
no_p.append(word)
cnt = 0
print '\n', no_p
''' ['ython', 'cython', 'zython', 'xyzthon', 'c++'] '''
You are almost there. The pattern you are using is looking for at least one letter that is not 'p'. You need a more strict one. Try:
pattern = re.compile('^[^p]*$')
Your understanding of character-set negation is flawed. The regex [^p] will match any string that has a character other than p in it, which is all of your strings. To "negate" a regex, simply negate the condition in the if statement. So:
import re
someList = ['python', 'ppython', 'ython', 'cython', '.python', '.ythop', 'zython', 'cpython', 'www.python.org', 'xyzthon', 'perl', 'javap', 'c++']
pattern = re.compile('p')
result = []
for word in someList:
if not pattern.search(word):
result.append(word)
print result
It is, of course, rather pointless to use a regex to see if a single specific character is in the string. Your second attempt is more apt for this, but it could be coded better:
result = []
for word in someList:
if 'p' not in word:
result.append(word)
print result

RegEx pattern returning all words except those in parenthesis

I have a text of the form:
können {konnte, gekonnt} Verb
And I want to get a match for all words in it that are not in parenthesis. That means:
können = 1st match, Verb = 2nd match
Unfortunately I still don't get the knock of regular expression. There is a lot of testing possibility but not much help for creation unless you want to read a book.
I will use them in Java or Python.
In Python you could do this:
import re
regex = re.compile(r'(?:\{.*?\})?([^{}]+)', re.UNICODE)
print 'Matches: %r' % regex.findall(u'können {konnte, gekonnt} Verb')
Result:
Matches: [u'können ', u' Verb']
Although I would recommend simply replacing everything between { and } like so:
import re
regex = re.compile(r'\{.*?\}', re.UNICODE)
print 'Output string: %r' % regex.sub('', u'können {konnte, gekonnt} Verb')
Result:
Output string: u'können Verb'
A regex SPLIT using this pattern will do the job:
(\s+|\s*{[^}]*\}\s*)
and ignore any empty value.

How to match regex with same format but different in terms of character set?

Suppose i have a string and i want to match only the part where value is empty and not the part where value is present?
for ex : &lang=&val=1233
I need only &lang and not &val as it has an actual value?
I have this
&(.+)=(?!\s\S)
regex which matches &lang=&val= in the string.
Can anyone help me out
Use following regular expression:
(?:(?<=\?)|&)[^=]+=(?=&|$)
could be explained as:
(?: ....): non-capturing (does not make a group), this may not needed according to your purpose.
\?: escaped ? to match ? literally.
(?<=\?): meaning "preceded by ?": ? is not included to the result.
(?=&|$): meaning "followed by &" or ~at end of the input".
Followings are sample test in Python interactive shell:
>>> pattern = r'(?:(?<=\?)|&)[^=]+=(?=&|$)'
>>> re.findall(pattern, '&lang=&val=')
['&lang=', '&val=']
>>> re.findall(pattern, '&lang=&val=1233')
['&lang=']
>>> re.findall(pattern, '&lang=&val=&val2=123&val3=')
['&lang=', '&val=', '&val3=']
>>> re.findall(pattern, '?lang=&val=&val2=123&val3=')
['lang=', '&val=', '&val3=']
>>> re.findall(pattern, '?lang=blah&val=&val2=123&val3=')
['&val=', '&val3=']
>>> re.findall(pattern, 'www.html.com?user=&lang=eng&code=.in')
do you mean
(&|?)([^&=]+)=(&|$)
(you can use non capturing groups if you need)
but I would just build a hash of all query string parameters and pick the keys without values. it is cheaper.
Try this:
[?&]([^&]+)=(&|$)
The first group will have the name of your parameter.
Note that this regex will also catch an empty first parameter (val1 in foo.php?val1=&val2=ok)
Try this one:
(&([^=]+))=(?=&)