convert string to regex pattern - regex

I want to find the pattern of a regular expression from a character string. My goal is to be able to reuse this pattern to find a string in another context but checking the pattern.
from sting "1example4whatitry2do",
I want to find pattern like: [0-9]{1}[a-z]{7}[0-9]{1}[a-z]{8}[0-9]{1}[a-z]{2}
So I can reuse this pattern to find this other example of sting 2eytmpxe8wsdtmdry1uo
I can do a loop on each caracter, but I hope there is a fast way
Thanks for your help !

You can puzzle this out:
go over your strings characterwise
if the character is a text character add a 't' to a list
if the character is a number add a 'd' to a list
if the character is something else, add itself to the list
Use itertools.groupby to group consecutive identical letters into groups.
Create a pattern from the group-key and the length of the group using some string literal formatting.
Code:
from itertools import groupby
from string import ascii_lowercase
lower_case = set(ascii_lowercase) # set for faster lookup
def find_regex(p):
cum = []
for c in p:
if c.isdigit():
cum.append("d")
elif c in lower_case:
cum.append("t")
else:
cum.append(c)
grp = groupby(cum)
return ''.join(f'\\{what}{{{how_many}}}'
if how_many>1 else f'\\{what}'
for what,how_many in ( (g[0],len(list(g[1]))) for g in grp))
pattern = "1example4...whatit.ry2do"
print(find_regex(pattern))
Output:
\d\t{7}\d\.{3}\t{6}\.\t{2}\d\t{2}
The ternary in the formatting removes not needed {1} from the pattern.
See:
str.isdigit()
If you now replace '\t'with '[a-z]' your regex should fit. You could also replace isdigit check using a regex r'\d' or a in set(string.digits) instead.
pattern = "1example4...whatit.ry2do"
pat = find_regex(pattern).replace(r"\t","[a-z]")
print(pat) # \d[a-z]{7}\d\.{3}[a-z]{6}\.[a-z]{2}\d[a-z]{2}
See
string module for ascii_lowercase and digits

Related

How to get all sub-strings of a specific format from a string

I have a large string and I want to get all sub-strings of format [[someword]] from it.
Meaning, get all words (list) which are wrapped in opening and closing square brackets.
Now one way to do this is splitting string by space and then filtering the list with this filter but the problem is some times [[someword]] does not exist as a word, it might have a ,, space or . right before of after it.
What is the best way to do this?
I will appreciate a solution in Scala but as this is more of a programming problem, I will convert your solution to Scala if it's in some other language I know e.g. Python.
This question is different from marked duplicate because the regex needs to able to accommodate characters other than English characters in between the brackets.
You can use this (?<=\[{2})[^[\]]+(?=\]{2}) regex to match and extract all the words you need that are contained in double square brackets.
Here is a Python solution,
import re
s = 'some text [[someword]] some [[some other word]]other text '
print(re.findall(r'(?<=\[{2})[^[\]]+(?=\]{2})', s))
Prints,
['someword', 'some other word']
I never worked in Scala but here is a solution in Java and as I know Scala is based upon Java only hence this may help.
String s = "some text [[someword]] some [[some other word]]other text ";
Pattern p = Pattern.compile("(?<=\\[{2})[^\\[\\]]+(?=\\]{2})");
Matcher m = p.matcher(s);
while(m.find()) {
System.out.println(m.group());
}
Prints,
someword
some other word
Let me know if this is what you were looking for.
Scala solution:
val text = "[[someword1]] test [[someword2]] test 1231"
val pattern = "\\[\\[(\\p{L}+)]\\]".r //match words with brackets and get content with group
val values = pattern
.findAllIn(text)
.matchData
.map(_.group(1)) //get 1st group
.toList
println(values)

Regex using condition on match results of another regex

I have a long string S and a string-to-string map M, where keys in M are the results of a regex match on S. I want to do a find-and-replace on S where, whenever one of the matches from that same regex is exactly one of my keys K in M, I replace it with its value M[K].
In order to do this I think I'd need to access the result of regex matches within a regex. If I try to store the result of a match and test equality outside a regex, I can't do my replace because I no longer know where the match was. How do I accomplish my goal?
Examples:
S = "abcd_a", regex = "[a-z]", M = {a:b}
result: "bbcd_b" because the regex would match the a's and replace them with b's
S = "abcd_a", regex = "[a-z]*", M = {a:b}
result: "abcd_b" because the regex would match "abcd" (but not replace it because it is not exactly "a") and the final 'a' (which it would replace because it is exactly "a")
EDIT Thanks for AlanMoore's suggestion. The code is now simpler.
I tried using python (2.7x) to solve this simple example, but it can be achieved with any other language. What's important is the approach (algorithm). Hope it helps:
import re
from itertools import cycle
S = "abcd_a"
REGEX = "[a-z]"
M = {'a':'b'}
def ReplaceWithDict(pattern):
# split by match group and map the match against map dict
return ''.join([M[v] if v and v in M else v for v in re.split(pattern, S)])
print ReplaceWithDict('([a-z])')
print ReplaceWithDict('([a-z]*)')
Output:
bbcd_b
abcd_b

Exclude words that contain my regular expression but are not my regular expression

I am trying to find a way of excluding the words that contain my regular expression, but are not my regular expression using the search method of a Text widget object. For example, suppose I have this regular expression "(if)|(def)", and words like define, definition or elif are all found by the re.search function, but I want a regular expression that finds exactly just if and def.
This is the code I am using:
import keyword
PY_KEYS = keyword.kwlist
PY_PATTERN = "^(" + ")|(".join(PY_KEYS) + ")$"
But it is still taking me words like define, but I want just words like def, even if define contains def.
I need this to highlight words in a tkinter.Text widget. The function I am using which is responsible for highlight the code is:
def highlight(self, event, pattern='', tag=KW, start=1.0, end="end", regexp=True):
"""Apply the given tag to all text that matches the given pattern
If 'regexp' is set to True, pattern will be treated as a regular
expression.
"""
if not isinstance(pattern, str) or pattern == '':
pattern = self.syntax_pattern # PY_PATTERN
# print(pattern)
start = self.index(start)
end = self.index(end)
self.mark_set("matchStart", start)
self.mark_set("matchEnd", start)
self.mark_set("searchLimit", end)
count = tkinter.IntVar()
while pattern != '':
index = self.search(pattern, "matchEnd", "searchLimit",
count=count, regexp=regexp)
# prints nothing
print(self.search(pattern, "matchEnd", "searchLimit",
count=count, regexp=regexp))
if index == "":
break
self.mark_set("matchStart", index)
self.mark_set("matchEnd", "%s+%sc" % (index, count.get()))
self.tag_add(tag, "matchStart", "matchEnd")
On the other hand, if PY_PATTERN = "\\b(" + "|".join(PY_KEYS) + ")\\b", then it highlights nothing, and you can see, if you put a print inside the function, that it's an empty string.
You can use anchors:
"^(?:if|def)$"
^ asserts position at the start of the string, and $ asserts position at the end of the string, asserting that nothing more can be matched unless the string is entirely if or def.
>>> import re
for foo in ["if", "elif", "define", "def", "in"]:
bar = re.search("^(?:if|def)$", foo)
print(foo, ' ', bar);
... if <_sre.SRE_Match object at 0x934daa0>
elif None
define None
def <_sre.SRE_Match object at 0x934daa0>
in None
You could use word boundaries:
"\b(if|def)\b"
The answers given are ok for Python's regular expression, but I have found in the meantime that the search method of a tkinter Text widget uses actually the Tcl's regular expressions style.
In this case, instead of wrapping the word or the regular expression with \b or \\b (if we are not using a raw string), we can simply use the corresponding Tcl word boundaries character, that is \y or \\y, which did the job in my case.
Watch my other question for more information.

Python RegEx query missing overlapping substrings

Python3.3, OS X 7.5
I am attempting to locate all instances of a 4-character substring defined as follows:
First character = 'N'
Second character = Anything but 'P'
Third character = 'S' or 'T'
Fourth character = Anything but 'P'
My query looks like this:
re.findall(r"\N[A-OQ-Z][ST][A-OQ-Z]", text)
This is working except in one particular case where two substrings overlap. That case involves the following 5character substring:
'...NNTSY...'
The query catches the first 4-character substring ('NNTS'), but not the second 4-character substring ('NTSY').
This is my first attempt at regular expressions, and obviously I'm missing something.
You can do this if the re engine does not consume characters as it matches them, which is possible with lookahead assertions:
import re
text = '...NNTSY...'
for m in re.findall(r'(?=(N[A-OQ-Z][ST][A-OQ-Z]))', text):
print(m)
Output:
NNTS
NTSY
Having everything within the assertion works but also feels weird. Another way is taking the N out of the assertion:
for m in re.findall(r'(N(?=([A-OQ-Z][ST][A-OQ-Z])))', text):
print(''.join(m))
From the Python 3 documentation (emphasis added):
$ python3 -c 'import re; help(re.findall)'
Help on function findall in module re:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more capturing groups are present in the pattern, return
a list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
If you want overlapping instances, use regex.search() in a loop. You have to compile the regular expression because the API for non-compiled regular expressions doesn't take a parameter to specify the starting position.
def findall_overlapping(pattern, string, flags=0):
"""Find all matches, even ones that overlap."""
regex = re.compile(pattern, flags)
pos = 0
while True:
match = regex.search(string, pos)
if not match:
break
yield match
pos = match.start() + 1
(N[^P](?:S|T)[^P])
Edit live on Debuggex

String separation in required format, Pythonic way? (with or w/o Regex)

I have a string in the format:
t='#abc #def Hello this part is text'
I want to get this:
l=["abc", "def"]
s='Hello this part is text'
I did this:
a=t[t.find(' ',t.rfind('#')):].strip()
s=t[:t.find(' ',t.rfind('#'))].strip()
b=a.split('#')
l=[i.strip() for i in b][1:]
It works for the most part, but it fails when the text part has the '#'.
Eg, when:
t='#abc #def My email is red#hjk.com'
it fails. The #names are there in the beginning and there can be text after #names, which may possibly contain #.
Clearly I can append initally with a space and find out first word without '#'. But that doesn't seem an elegant solution.
What is a pythonic way of solving this?
Building unashamedly on MrTopf's effort:
import re
rx = re.compile("((?:#\w+ +)+)(.*)")
t='#abc #def #xyz Hello this part is text and my email is foo#ba.r'
a,s = rx.match(t).groups()
l = re.split('[# ]+',a)[1:-1]
print l
print s
prints:
['abc', 'def', 'xyz']
Hello this part is text and my email is foo#ba.r
Justly called to account by hasen j, let me clarify how this works:
/#\w+ +/
matches a single tag - # followed by at least one alphanumeric or _ followed by at least one space character. + is greedy, so if there is more than one space, it will grab them all.
To match any number of these tags, we need to add a plus (one or more things) to the pattern for tag; so we need to group it with parentheses:
/(#\w+ +)+/
which matches one-or-more tags, and, being greedy, matches all of them. However, those parentheses now fiddle around with our capture groups, so we undo that by making them into an anonymous group:
/(?:#\w+ +)+/
Finally, we make that into a capture group and add another to sweep up the rest:
/((?:#\w+ +)+)(.*)/
A last breakdown to sum up:
((?:#\w+ +)+)(.*)
(?:#\w+ +)+
( #\w+ +)
#\w+ +
Note that in reviewing this, I've improved it - \w didn't need to be in a set, and it now allows for multiple spaces between tags. Thanks, hasen-j!
t='#abc #def Hello this part is text'
words = t.split(' ')
names = []
while words:
w = words.pop(0)
if w.startswith('#'):
names.append(w[1:])
else:
break
text = ' '.join(words)
print names
print text
How about this:
Splitting by space.
foreach word, check
2.1. if word starts with # then Push to first list
2.2. otherwise just join the remaining words by spaces.
You might also use regular expressions:
import re
rx = re.compile("#([\w]+) #([\w]+) (.*)")
t='#abc #def Hello this part is text and my email is foo#ba.r'
a,b,s = rx.match(t).groups()
But this all depends on how your data can look like. So you might need to adjust it. What it does is basically creating group via () and checking for what's allowed in them.
[i.strip('#') for i in t.split(' ', 2)[:2]] # for a fixed number of #def
a = [i.strip('#') for i in t.split(' ') if i.startswith('#')]
s = ' '.join(i for i in t.split(' ') if not i.startwith('#'))
[edit: this is implementing what was suggested by Osama above]
This will create L based on the # variables from the beginning of the string, and then once a non # var is found, just grab the rest of the string.
t = '#one #two #three some text afterward with # symbols# meow#meow'
words = t.split(' ') # split into list of words based on spaces
L = []
s = ''
for i in range(len(words)): # go through each word
word = words[i]
if word[0] == '#': # grab #'s from beginning of string
L.append(word[1:])
continue
s = ' '.join(words[i:]) # put spaces back in
break # you can ignore the rest of the words
You can refactor this to be less code, but I'm trying to make what is going on obvious.
Here's just another variation that uses split() and no regexpes:
t='#abc #def My email is red#hjk.com'
tags = []
words = iter(t.split())
# iterate over words until first non-tag word
for w in words:
if not w.startswith("#"):
# join this word and all the following
s = w + " " + (" ".join(words))
break
tags.append(w[1:])
else:
s = "" # handle string with only tags
print tags, s
Here's a shorter but perhaps a bit cryptic version that uses a regexp to find the first space followed by a non-# character:
import re
t = '#abc #def My email is red#hjk.com #extra bye'
m = re.search(r"\s([^#].*)$", t)
tags = [tag[1:] for tag in t[:m.start()].split()]
s = m.group(1)
print tags, s # ['abc', 'def'] My email is red#hjk.com #extra bye
This doesn't work properly if there are no tags or no text. The format is underspecified. You'll need to provide more test cases to validate.