I have a long string S and a string-to-string map M, where keys in M are the results of a regex match on S. I want to do a find-and-replace on S where, whenever one of the matches from that same regex is exactly one of my keys K in M, I replace it with its value M[K].
In order to do this I think I'd need to access the result of regex matches within a regex. If I try to store the result of a match and test equality outside a regex, I can't do my replace because I no longer know where the match was. How do I accomplish my goal?
Examples:
S = "abcd_a", regex = "[a-z]", M = {a:b}
result: "bbcd_b" because the regex would match the a's and replace them with b's
S = "abcd_a", regex = "[a-z]*", M = {a:b}
result: "abcd_b" because the regex would match "abcd" (but not replace it because it is not exactly "a") and the final 'a' (which it would replace because it is exactly "a")
EDIT Thanks for AlanMoore's suggestion. The code is now simpler.
I tried using python (2.7x) to solve this simple example, but it can be achieved with any other language. What's important is the approach (algorithm). Hope it helps:
import re
from itertools import cycle
S = "abcd_a"
REGEX = "[a-z]"
M = {'a':'b'}
def ReplaceWithDict(pattern):
# split by match group and map the match against map dict
return ''.join([M[v] if v and v in M else v for v in re.split(pattern, S)])
print ReplaceWithDict('([a-z])')
print ReplaceWithDict('([a-z]*)')
Output:
bbcd_b
abcd_b
Related
I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.
from sting "1example4whatitry2do",
I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}
So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo
I can do a loop on each caracter, but I hope there is a fast way
Thanks for your help !
You can puzzle this out:
go over your strings characterwise
if the character is a text character add a 't' to a list
if the character is a number add a 'd' to a list
if the character is something else, add itself to the list
Use itertools.groupby to group consecutive identical letters into groups.
Create a pattern from the group-key and the length of the group using some string literal formatting.
Code:
from itertools import groupby
from string import ascii_lowercase
lower_case = set(ascii_lowercase) # set for faster lookup
def find_regex(p):
cum = []
for c in p:
if c.isdigit():
cum.append("d")
elif c in lower_case:
cum.append("t")
else:
cum.append(c)
grp = groupby(cum)
return ''.join(f'\\{what}{{{how_many}}}'
if how_many>1 else f'\\{what}'
for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))
pattern = "1example4...whatit.ry2do"
print(find_regex(pattern))
Output:
\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}
The ternary in the formatting removes not needed {1} from the pattern.
See:
str.isdigit()
If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.
pattern = "1example4...whatit.ry2do"
pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}
See
string module for ascii_lowercase and digits
I am trying to pull a certain number from various strings. The number has to be standalone, before ', or before (. The regex I came up with was:
\b(?<!\()(x)\b(,|\(|'|$) <- x is the numeric number.
If x is 2, this pulls the following string (almost) fine, except it also pulls 2'abd'. Any advice what I did wrong here?
2(2'Abf',3),212,2'abc',2(1,2'abd',3)
Your actual question is, as I understand it, get these specific number except those in parenthesis.
To do so I suggest using the skip_what_to_avoid|what_i_want pattern like this:
(\((?>[^()\\]++|\\.|(?1))*+\))
|\b(2)(?=\b(?:,|\(|'|$))
The idea here is to completely disregard the overall matches (and there first group use for the recursive pattern to capture everything between parenthesis: (\((?>[^()\\]++|\\.|(?1))*+\))): that's the trash bin. Instead, we only need to check capture group $2, which, when set, contains the asterisks outside of comments.
Demo
Sample Code:
import regex as re
regex = r"(\((?>[^()\\]++|\\.|(?1))*+\))|\b(2)(?=\b(?:,|\(|'|$))"
test_str = "2(2'Abf',3),212,2'abc',2(1,2'abd',3)"
matches = re.finditer(regex, test_str, re.MULTILINE)
for matchNum, match in enumerate(matches):
matchNum = matchNum + 1
if match.groups()[1] is not None:
print ("Found at {start}-{end}: {group}".format(start = match.start(2), end = match.end(2), group = match.group(2)))
Output:
Found at 0-1: 2
Found at 16-17: 2
Found at 23-24: 2
This solution requires the alternative Python regex package.
I have a list of 8-letter sequences like this:
['GQPLWLEH', 'TLYSFFPK', 'TYGEIFEK', 'APYWLINK', ...]
How can I use regular expressions to find all the sequences that have the specific letters at specific positions within the sequence? For example, the letters V, I, F, or Y at the 2nd letter in the sequence and the letters M, L, F, Y at the 3rd position in the sequence.
I'm really new to RE, thanks in advance!
You can try using the following regex pattern:
.[VIFY][MLFY].*
This will match any first character, followed by a second and third character using the logic you want.
import re
mylist = ['GQPLWLEH', 'TLYSFFPK', 'TYGEIFEK', 'APYWLINK']
r = re.compile(".[VIFY][MLFY].*")
newlist = filter(r.match, mylist)
print str(newlist)
Demo here:
Rextester
Note: I added the word BILL to your list in the demo to get something which passes the regex match.
\b.[VIFY][MLFY]\w*\b
This may satisfy what you want. You can play with regex online at regex101
Maybe you can avoid using a regexp altogether:
[x for x in mylist if x[1] in 'VIFY' and x[2] in 'MLFY']
I have a pattern 'NewTree' and I want to get all strings that don't contain this pattern 'NewTree'. How do I use regex to do the filter?
So if I have 1.BoostKite 2.SetTree 3. ComeNewTreeNow
Then the output should be BoostKite and SetTree.
Any suggestions? I wanted regex that can work anywhere and not use any language specific function.
You can try using a Negative Lookahead if you want to use a regular expression.
^(?!.*NewTree).*$
Live Demo
Alternatively you can use the alternation operator in context placing what you want to exclude on the left, ( saying throw this away, it's garbage ) and place what you want to match in a capturing group on the right side.
\w*NewTree\w*|([a-zA-Z]+)
Live Demo
In Python:
( The strings being in list context, as you commented 'array' above )
>>> import re
>>> regex = re.compile(r'^(?!.*NewTree).*$')
>>> mylst = ['BoostKite', 'SetTree', 'ComeNewTree', 'NewTree']
>>> matches = [x for x in mylst if regex.match(x)]
['BoostKite', 'SetTree']
If it is just a long string of multiple words and you want to ignore the words that contain NewTree
>>> s = '1.BoostKite 2.SetTree 3. ComeNewTreeNow 4. foo 5. bar'
>>> filter(None, re.findall(r'\w*NewTree\w*|([a-zA-Z]+)', s))
['BoostKite', 'SetTree', 'foo', 'bar']
You can do this without a regular expression as well.
>>> mylst = ['BoostKite', 'SetTree', 'ComeNewTree', 'NewTree']
>>> matches = [x for x in mylst if "NewTree" not in x]
['BoostKite', 'SetTree']
Match each word with the regex \w+NewTree\b. It returns true if it ends with NewTree
Use i modifier for case insensitive match (ignores case of [a-zA-Z])
Use \w* instead of \w+ in above regex if you want to match for NewTree word as well.
If you are looking for contains NewTree then try this regex \w*NewTree\w*\b
I think you can do this in general in the manner of the following example for your specific case:
^(([^N]|N[^e]|Ne[^w]|New[^T]|NewT[^r]|NewTr[^e]|NewTre[^e])+)?(.|..|...|....|.....)?$
So far what I have here is a near miss. It will not match any string that has substring NewTree. But it will not match every string that is free of the substring NewTree. In particular it will not match Nvwxyz.
I'm trying to get re.search to find strings that don't have the letter p in them. My regex code returns everything in the list which is what I don't want. I wrote an alternate solution that gives me the exact results that I want, but I want to see if this can be solved with re.search, but I'll also accept another regex solution. I also tried re.findall and that didn't work, and re.match won't work because it looks for the pattern at the beginning of a string.
import re
someList = ['python', 'ppython', 'ython', 'cython', '.python', '.ythop', 'zython', 'cpython', 'www.python.org', 'xyzthon', 'perl', 'javap', 'c++']
# this returns everything from the source list which is what I DON'T want
pattern = re.compile('[^p]')
result = []
for word in someList:
if pattern.search(word):
result.append(word)
print '\n', result
''' ['python', 'ppython', 'ython', 'cython', '.python', '.ythop', 'zython', 'cpython', 'www.python.org', 'xyzthon', 'perl', 'javap', 'c++'] '''
# this non regex solution returns the results I want
cnt = 0; no_p = []
for word in someList:
for letter in word:
if letter == 'p':
cnt += 1
pass
if cnt == 0:
no_p.append(word)
cnt = 0
print '\n', no_p
''' ['ython', 'cython', 'zython', 'xyzthon', 'c++'] '''
You are almost there. The pattern you are using is looking for at least one letter that is not 'p'. You need a more strict one. Try:
pattern = re.compile('^[^p]*$')
Your understanding of character-set negation is flawed. The regex [^p] will match any string that has a character other than p in it, which is all of your strings. To "negate" a regex, simply negate the condition in the if statement. So:
import re
someList = ['python', 'ppython', 'ython', 'cython', '.python', '.ythop', 'zython', 'cpython', 'www.python.org', 'xyzthon', 'perl', 'javap', 'c++']
pattern = re.compile('p')
result = []
for word in someList:
if not pattern.search(word):
result.append(word)
print result
It is, of course, rather pointless to use a regex to see if a single specific character is in the string. Your second attempt is more apt for this, but it could be coded better:
result = []
for word in someList:
if 'p' not in word:
result.append(word)
print result