using input() with regular expressions in python - regex

Is it possible to use an input() with regex
I've written something like this
import re
words = ['cats', 'cates', 'dog', 'ship']
for l in words:
m = re.search( r'cat..', l)
if m:
print l
else:
print 'none'
this will return 'cates'
But now I want to be able to use my own input() in ' m = re.search( r'cat..', l) '
something like
import re
words = ['cats', 'cates', 'dog', 'ship']
target = input()
for l in words:
m = re.search( r'target..', l)
if m:
print l
else:
print 'none'
this doesn't work of course (I know it will search for the word 'target' and not for the input()).
Is there a way to do this or are'nt regular expressions not the solution for my problem?

You could construct the RegEx dynamically:
target = raw_input() # use raw_input() to avoid automatically eval()-ing.
rx = re.compile(re.escape(target) + '..')
# use re.escape() to escape special characters.
for l in words:
m = rx.search(l)
....
But it is also possible without RegEx:
target = raw_input()
for l in words:
if l[:-2] == target:
print l
else:
print 'none'

Related

When using pandas is it possible to replace the re package with the regex package? [duplicate]

I am trying to check for fuzzy match between a string column and a reference list. The string series contains over 1 m rows and the reference list contains over 10 k entries.
For eg:
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k rows
###Output should look like
df['MATCH'] = pd.Series([Nan, 'XANDER', 'MANDER', 'PARIS', 'HARIS', Nan, 'PARIS', Nan])
It should generate match if the word appears separately in the string (and within that, upto 1 char substitution allowed)
For eg - 'PARIS' can match against 'PARIS HILTON', 'THE HARIS DOWNTOWN', but not against 'APARISIAN'.
Similarly, 'XANDER' matches against 'NOVA XANDER' and 'SALA MANDER' (MANDER being 1 char diff from XANDER) , but does not generate match against 'ALEXANDERS'.
As of now, we have written the logic for the same (shown below), although the match takes over 4 hrs to run.. Need to get this to under 30 mins.
Current code -
tags_regex = ref_df['REF_NAMES'].tolist()
tags_ptn_regex = '|'.join([f'\s+{tag}\s+|^{tag}\s+|\s+{tag}$' for tag in tags_regex])
def search_it(partyname):
m = regex.search("("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}",partyname):
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].str.apply(search_it)
Also, will multiprocessing help with regex ? Many thanks in advance!
Your pattern is rather inefficient, as you repeat tag pattern thrice in the regex. You just need to create a pattern with the so-called whitespace boundaries, (?<!\S) and (?!\S), and you will only need one tag pattern.
Next, if you have several thousands alternative, even the single tag pattern regex will be extremely slow because there can appear such alternatives that match at the same location in the string, and thus, there will be too much backtracking.
To reduce this backtracking, you will need a regex trie.
Here is a working snippet:
import regex
import pandas as pd
## Class to build a regex trie, see https://stackoverflow.com/a/42789508/3832970
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return regex.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
## Start of main code
df = pd.DataFrame()
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df = pd.DataFrame()
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k row
trie = Trie()
for word in ref_df['REF_NAMES'].tolist():
trie.add(word)
tags_ptn_regex = regex.compile(r"(?:(?<!\S)(?:{})(?!\S)){{s<=1:[A-Z]}}".format(trie.pattern()), regex.IGNORECASE)
def search_it(partyname):
m = tags_ptn_regex.search(partyname)
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].apply(search_it)

Text processing to get if else type condition from a string

First of all, I am sorry about the weird question heading. Couldn't express it in one line.
So, the problem statement is,
If I am given the following string --
"('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
I have to parse it as
list1 = ["'James Gosling'", 'jamesgosling', 'jame gosling']
list2 = ["'SUN Microsystem'", 'sunmicrosystem']
list3 = [ list1, list2, keyword]
So that, if I enter James Gosling Sun Microsystem keyword it should tell me that what I have entered is 100% correct
And if I enter J Gosling Sun Microsystem keyword it should say i am only 66.66% correct.
This is what I have tried so far.
import re
def main():
print("starting")
sentence = "('James Gosling'/jamesgosling/jame gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
splited = sentence.split(",")
number_of_primary_keywords = len(splited)
#print(number_of_primary_keywords, "primary keywords length")
number_of_brackets = 0
inside_quotes = ''
inside_quotes_1 = ''
inside_brackets = ''
for n in range(len(splited)):
#print(len(re.findall('\w+', splited[n])), "length of splitted")
inside_brackets = splited[n][splited[n].find("(") + 1: splited[n].find(")")]
synonyms = inside_brackets.split("/")
for x in range(len(synonyms)):
try:
inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]
print(inside_quotes_1)
except:
pass
try:
inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]
print(inside_quotes)
except:
pass
#print(synonyms[x])
number_of_brackets += 1
print(number_of_brackets)
if __name__ == '__main__':
main()
Output is as follows
'James Gosling
jamesgoslin
jame goslin
'SUN Microsystem
SUN Microsystem
sunmicrosyste
sunmicrosyste
3
As you can see, the last letters of some words are missing.
So, if you read this far, I hope you can help me in getting the expected output
Unfortunately, your code has a logic issue that I could not figure it out, however there might be in these lines:
inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]
inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]
which by the way you can simply use:
inside_quotes_1 = synonyms[x][synonyms[x].find("\x22") + 1: synonyms[n].find("\x22")]
inside_quotes = synonyms[x][synonyms[x].find("\x27") + 1: synonyms[n].find("\x27")]
Other than that, you seem to want to extract the words with their indices, which you can extract them using a basic expression:
(\w+)
Then, you might want to find a simple way to locate the indices, where the words are. Then, associate each word to the desired indices.
Example Test
# -*- coding: UTF-8 -*-
import re
string = "('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
expression = r'(\w+)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches! Something is not right! Call 911 👮')

Return first instance of capturing group if found, otherwise empty string

My inputs are strings that may or may not contain a pattern:
p = '(\d)'
s = 'abcd3f'
I want to return the capturing group for the first match of this pattern if it is found, and an empty string otherwise.
result = re.search(p, s)[1]
Will return the first match. But if s = 'abcdef' then search will return None and the indexing will throw an exception. Instead of doing that, I'd like it to just return an empty string. I can do:
g = re.search(p, s)
result = ''
if len(g) > 0: result = g[1]
Or even:
try:
result = re.search(p, s)[1]
except:
result = ''
But these both seem pretty complicated for something so simple. Is there a more elegant way of accomplishing what I want, preferably in one line?
You could use if YourString is None: to accomplish that. For example:
if s is None : s = ''
Example for Python:
import re
m = re.search('(\d)', 'ab1cdf')
if m is None : m = ''
print m.group(1)

How to count words with one syllable in a list of strings of one word using regular expressions

I'm trying to count the number of words, in a pretty long text, that have one syllable. This was defined as words that have zero or more consonants followed by 1 or more vowels followed by zero or more consonants.
The text has been lowercased and split into a list of strings of single words. Yet everytime I try to use RE's to get the count I get an error because the object is a list and not a string.
How would I do this in a list?
f = open('pg36.txt')
war = f.read()
warlow = war.lower()
warsplit = warlow.split()
import re
def syllables():
count = len(re.findall('[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*', warsplit))
return count
print (count)
syllables()
Because you're trying to use findall function against the list not the string, since findall works only against the string . So you could try the below.
import re
f = open('file')
war = f.read()
warlow = war.lower()
warsplit = warlow.split()
def syllables():
count = 0
for i in warsplit:
if re.match(r'^[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*$', i):
count += 1
return count
print syllables()
f.close()
OR
Use findall function directly on warlow variable.
import re
f = open('file')
war = f.read()
warlow = war.lower()
print len(re.findall(r'(?<!\S)[bcdfghjklmnpqrstvwxyz]*[aeiou]+[bcdfghjklmnpqrstvwxyz]*(?!\S)', warlow))
f.close()
Try this regex instead:
^[^aeiouAEIOU]*[aeiouAEIOU]+[^aeiouAEIOU]*$

Use Python regular expression to extract special strings

Given strings like:
str = '12-1 abcd fadf adfad'
I want to get 12-1. How can you do it in python?
I'm using the following code, but does not work.
m = re.search('.*(\number+-\number+).*', str)
if m:
found = m.group(0)
print found
Try:
import re
str = '12-1 abcd fadf adfad'
m = re.search('(\d+-\d+)', str)
if m:
found = m.group(0)
print found