regular expression to contain all strings that don't contain a pattern - regex

I have a pattern 'NewTree' and I want to get all strings that don't contain this pattern 'NewTree'. How do I use regex to do the filter?
So if I have 1.BoostKite 2.SetTree 3. ComeNewTreeNow
Then the output should be BoostKite and SetTree.
Any suggestions? I wanted regex that can work anywhere and not use any language specific function.

You can try using a Negative Lookahead if you want to use a regular expression.
^(?!.*NewTree).*$
Live Demo
Alternatively you can use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
\w*NewTree\w*|([a-zA-Z]+)
Live Demo
In Python:
( The strings being in list context, as you commented 'array' above )
>>> import re
>>> regex = re.compile(r'^(?!.*NewTree).*$')
>>> mylst = ['BoostKite', 'SetTree', 'ComeNewTree', 'NewTree']
>>> matches = [x for x in mylst if regex.match(x)]
['BoostKite', 'SetTree']
If it is just a long string of multiple words and you want to ignore the words that contain NewTree
>>> s = '1.BoostKite 2.SetTree 3. ComeNewTreeNow 4. foo 5. bar'
>>> filter(None, re.findall(r'\w*NewTree\w*|([a-zA-Z]+)', s))
['BoostKite', 'SetTree', 'foo', 'bar']
You can do this without a regular expression as well.
>>> mylst = ['BoostKite', 'SetTree', 'ComeNewTree', 'NewTree']
>>> matches = [x for x in mylst if "NewTree" not in x]
['BoostKite', 'SetTree']

Match each word with the regex \w+NewTree\b. It returns true if it ends with NewTree
Use i modifier for case insensitive match (ignores case of [a-zA-Z])
Use \w* instead of \w+ in above regex if you want to match for NewTree word as well.
If you are looking for contains NewTree then try this regex \w*NewTree\w*\b

I think you can do this in general in the manner of the following example for your specific case:
^(([^N]|N[^e]|Ne[^w]|New[^T]|NewT[^r]|NewTr[^e]|NewTre[^e])+)?(.|..|...|....|.....)?$
So far what I have here is a near miss. It will not match any string that has substring NewTree. But it will not match every string that is free of the substring NewTree. In particular it will not match Nvwxyz.

Related

Comparing strings with regex

I basically want to match strings like: "something", "some,thing", "some,one,thing", but I want to not match expressions like: ',thing', '_thing,' , 'some_thing'.
The pattern I want to match is: A string beginning with only letters and the rest of the body can be a comma, space or letters.
Here's what I did:
import re
x=re.compile('^[a-zA-z][a-zA-z, ]*') #there's space in the 2nd expression here
stri='some_thing'
x.match(str)
It gives me:
<_sre.SRE_Match object; span=(0, 4), match='some'>
The thing is, my regex somehow works but, it actually extracts the parts of the string that do match, but I want to compare the entire string with the regular expression pattern and return False if it does not match the pattern. How do I do this?
You use [a-Z] which matches more thank you think.
If you want to match [a-zA-Z] for both you might use the case insensitive flag:
import re
x=re.compile('^[a-z][a-z, ]*$', re.IGNORECASE)
stri='some,thing'
if x.match(stri):
print ("Match")
else:
print ("No match")
Test
the easiest way would be to just compare the result to the original string.
import re
x=re.compile('^[a-zA-z][a-zA-z, ]*')
str='some_thing'
x.match(str).group(0) == str #-> False
str = 'some thing'
x.match(str).group(0) == str #-> True

Regex using condition on match results of another regex

I have a long string S and a string-to-string map M, where keys in M are the results of a regex match on S. I want to do a find-and-replace on S where, whenever one of the matches from that same regex is exactly one of my keys K in M, I replace it with its value M[K].
In order to do this I think I'd need to access the result of regex matches within a regex. If I try to store the result of a match and test equality outside a regex, I can't do my replace because I no longer know where the match was. How do I accomplish my goal?
Examples:
S = "abcd_a", regex = "[a-z]", M = {a:b}
result: "bbcd_b" because the regex would match the a's and replace them with b's
S = "abcd_a", regex = "[a-z]*", M = {a:b}
result: "abcd_b" because the regex would match "abcd" (but not replace it because it is not exactly "a") and the final 'a' (which it would replace because it is exactly "a")
EDIT Thanks for AlanMoore's suggestion. The code is now simpler.
I tried using python (2.7x) to solve this simple example, but it can be achieved with any other language. What's important is the approach (algorithm). Hope it helps:
import re
from itertools import cycle
S = "abcd_a"
REGEX = "[a-z]"
M = {'a':'b'}
def ReplaceWithDict(pattern):
# split by match group and map the match against map dict
return ''.join([M[v] if v and v in M else v for v in re.split(pattern, S)])
print ReplaceWithDict('([a-z])')
print ReplaceWithDict('([a-z]*)')
Output:
bbcd_b
abcd_b

Regex to catch a string without () in 3 patterns like abc(ef) ,(ef)abc and (ef)abc(gh)

I have tested this Regex
(?<=\))(.+?)(?=\()|(?<=\))(.+?)\b|(.+?)(?=\()
but it doesn't work for strings like this pattern (ef)abc(gh).
I got a result like this "(ef)abc".
But these 3 regexes (?<=\))(.+?)(?=\() , (?<=\))(.+?)\b, (.+?)(?=\()
do work separately for "(ef)abc(gh)", "(ef)abc" ,"abc(ef)" .
can anyone tell me where the problem is or how can I get the expected result?
Assuming you are looking to match the text from between the elements in parenthesis, try this:
^(?:\(\w*\))?([\w]*)(?:\(\w*\))?$
^ - beginning of string
(?:\(\w*\))? - non-capturing group, match 0 or more alphabetic letters within parens, all optional
([\w]*) - capturing group, match 0 or more alphabetic letters
(?:\(\w*\))? - non-capturing group, match 0 or more alphabetic letters within parens, all optional
$ - end of string
You haven't specified what language you might be using, but here is an example in Python:
>>> import re
>>> string = "(ef)abc(gh)"
>>> string2 = "(ef)abc"
>>> string3 = "abc(gh)"
>>> p = re.compile(r'^(?:\(\w*\))?([\w]*)(?:\(\w*\))?$')
>>> m = re.search(p, string)
>>> m2 = re.search(p, string2)
>>> m3 = re.search(p, string3)
>>> print m.groups()[0]
'abc'
>>> print m2.groups()[0]
'abc'
>>> print m3.groups()[0]
'abc'
\([^)]+\)|([^()\n]+)
Try this.Just grab the capture or group.See demo.
https://regex101.com/r/tX2bH4/6
Your problem is that (.+?)(?=\() matches "(ef)abc" in "(ef)abc(gh)".
The easiest solution to this problem is be more explicit about what you are looking for. In this case by exchanging "any character" ., with "any character that is not a parenthesis" [^\(\)].
(?<=\))([^\(\)]+?)(?=\()|(?<=\))([^\(\)]+?)\b|([^\(\)]+?)(?=\()
A cleaner regexp would be
(?:(?<=^)|(?<=\)))([^\(\)]+)(?:(?=\()|(?=$))

Python RegEx query missing overlapping substrings

Python3.3, OS X 7.5
I am attempting to locate all instances of a 4-character substring defined as follows:
First character = 'N'
Second character = Anything but 'P'
Third character = 'S' or 'T'
Fourth character = Anything but 'P'
My query looks like this:
re.findall(r"\N[A-OQ-Z][ST][A-OQ-Z]", text)
This is working except in one particular case where two substrings overlap. That case involves the following 5character substring:
'...NNTSY...'
The query catches the first 4-character substring ('NNTS'), but not the second 4-character substring ('NTSY').
This is my first attempt at regular expressions, and obviously I'm missing something.
You can do this if the re engine does not consume characters as it matches them, which is possible with lookahead assertions:
import re
text = '...NNTSY...'
for m in re.findall(r'(?=(N[A-OQ-Z][ST][A-OQ-Z]))', text):
print(m)
Output:
NNTS
NTSY
Having everything within the assertion works but also feels weird. Another way is taking the N out of the assertion:
for m in re.findall(r'(N(?=([A-OQ-Z][ST][A-OQ-Z])))', text):
print(''.join(m))
From the Python 3 documentation (emphasis added):
$ python3 -c 'import re; help(re.findall)'
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
If you want overlapping instances, use regex.search() in a loop. You have to compile the regular expression because the API for non-compiled regular expressions doesn't take a parameter to specify the starting position.
def findall_overlapping(pattern, string, flags=0):
"""Find all matches, even ones that overlap."""
regex = re.compile(pattern, flags)
pos = 0
while True:
match = regex.search(string, pos)
if not match:
break
yield match
pos = match.start() + 1
(N[^P](?:S|T)[^P])
Edit live on Debuggex

How to match regex with same format but different in terms of character set?

Suppose i have a string and i want to match only the part where value is empty and not the part where value is present?
for ex : &lang=&val=1233
I need only &lang and not &val as it has an actual value?
I have this
&(.+)=(?!\s\S)
regex which matches &lang=&val= in the string.
Can anyone help me out
Use following regular expression:
(?:(?<=\?)|&)[^=]+=(?=&|$)
could be explained as:
(?: ....): non-capturing (does not make a group), this may not needed according to your purpose.
\?: escaped ? to match ? literally.
(?<=\?): meaning "preceded by ?": ? is not included to the result.
(?=&|$): meaning "followed by &" or ~at end of the input".
Followings are sample test in Python interactive shell:
>>> pattern = r'(?:(?<=\?)|&)[^=]+=(?=&|$)'
>>> re.findall(pattern, '&lang=&val=')
['&lang=', '&val=']
>>> re.findall(pattern, '&lang=&val=1233')
['&lang=']
>>> re.findall(pattern, '&lang=&val=&val2=123&val3=')
['&lang=', '&val=', '&val3=']
>>> re.findall(pattern, '?lang=&val=&val2=123&val3=')
['lang=', '&val=', '&val3=']
>>> re.findall(pattern, '?lang=blah&val=&val2=123&val3=')
['&val=', '&val3=']
>>> re.findall(pattern, 'www.html.com?user=&lang=eng&code=.in')
do you mean
(&|?)([^&=]+)=(&|$)
(you can use non capturing groups if you need)
but I would just build a hash of all query string parameters and pick the keys without values. it is cheaper.
Try this:
[?&]([^&]+)=(&|$)
The first group will have the name of your parameter.
Note that this regex will also catch an empty first parameter (val1 in foo.php?val1=&val2=ok)
Try this one:
(&([^=]+))=(?=&)