Python regular expression with unicode - regex

I use python 2.7 (I cannot use 3.4),
text = """
saú$_ß$¤×÷asd县阴őasdCharacters: \"县阴 asdsadsasd县阴
"""
text = unicode(text, "utf-8")
print("Method 1\n")
reg = "Characters: \"[\u4e00-\u9fff]+.*?"
reg = unicode(reg, "utf-8")
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
print("Method 2\n")
reg = u"Characters: \"[\u4e00-\u9fff]+.*?"
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
The output is:
Method 1
Method 2
Results: Characters: "??
The question is how can I make the method 2 result with variables. I didn't find any solution yet and I don't understand why the method 1 doesn't work.
Thanks for any suggestion.

Method 1 does not work because \u#### doesn't mean anything in the case of an encoded sequence. Instead, you need the correct sequence in bytes. If you do this, then method 1 will produce the same results as method 2. I modified your code as follows:
# -*- coding: utf-8 -*-
import sys
import re
text = """
saú$_ß$¤×÷asd县阴őasdCharacters: \"县阴 asdsadsasd县阴
"""
text = unicode(text, "utf-8")
print("\nMethod 1\n")
reg = "Characters: \"[\xe4\xb8\x80-\xe9\xbf\xbf]+.*?"
reg = unicode(reg, "utf-8")
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
print("\nMethod 2\n")
reg = u"Characters: \"[\u4e00-\u9fff]+.*?"
pattern = re.compile(reg, re.UNICODE | re.MULTILINE)
for m in re.findall(pattern, text): # Number of occurrences in the 'k' line.
print("Results: %s" % m.encode(sys.stdout.encoding, errors='replace'))
It produces the following results on my machine:
Method 1
Results: Characters: "县阴
Method 2
Results: Characters: "县阴

Related

Trouble sorting a list after using regex

The code below is parsing data from this text sample:
rf-Parameters-v1020
supportedBandCombination-r10: 128 items
Item 0
BandCombinationParameters-r10: 1 item
Item 0
BandParameters-r10
bandEUTRA-r10: 2
bandParametersUL-r10: 1 item
Item 0
CA-MIMO-ParametersUL-r10
ca-BandwidthClassUL-r10: a (0)
bandParametersDL-r10: 1 item
Item 0
CA-MIMO-ParametersDL-r10
ca-BandwidthClassDL-r10: a (0)
supportedMIMO-CapabilityDL-r10: fourLayers (1)
I am having trouble replacing the first 'a' from the "ca-BandwidthClassUL-r10" line with 'u' and placing it before 'm' in the final output: [2 a(0) u m]
import re
regex = r"bandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassUL-r10:).*)*\r?\nca-BandwidthClassUL-r10*: *(\w.*)(" \
r"?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *(" \
r"\w.*)\nsupportedMIMO-CapabilityDL-r10: *(.*) "
regex2 = r"^.*bandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassUL-r10:).*)*\r?\nca-BandwidthClassUL-r10*: *(\w.*)(?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *(\w.*)\nsupportedMIMO-CapabilityDL-r10: *(.*)(?:\r?\n(?!bandEUTRA-r10:).*)*\r?\nbandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *(\w.*)\nsupportedMIMO-CapabilityDL-r10: *(.*)"
my_file = open("files.txt", "r")
content = my_file.read().replace("fourLayers", 'm').replace("twoLayers", " ")
#print(content)
#if 'BandCombinationParameters-r10: 1 item' in content:
result = ["".join(m) for m in re.findall(regex, content, re.MULTILINE)]
print(result)
You might use an optional part where you capture group 2.
Then you can print group 3 concatenated with u if there is group 2, else only print group 3.
As you are already matching the text in the regex, you don't have to do the separate replacement calls. You can use the text in the replacement itself.
bandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassUL-r10:).*)*(?:\r?\n(ca-BandwidthClassUL-r10)?: *(\w.*))(?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *\w.*\nsupportedMIMO-CapabilityDL-r10:
Regex demo | Python demo
For example
import re
regex = r"bandEUTRA-r10: *(\d+)(?:\r?\n(?!ca-BandwidthClassUL-r10:).*)*(?:\r?\n(ca-BandwidthClassUL-r10)?: *(\w.*))(?:\r?\n(?!ca-BandwidthClassDL-r10:).*)*\r?\nca-BandwidthClassDL-r10*: *\w.*\nsupportedMIMO-CapabilityDL-r10:"
s = "here the example data with and without ca-BandwidthClassUL-r10"
matches = re.finditer(regex, s, re.MULTILINE)
for matchNum, match in enumerate(matches, start=1):
result = "{0}{1} m".format(
match.group(1),
match.group(3) + " u" if match.group(2) else match.group(3)
)
print(result)
Output
2a (0) u m
2a (0) m

When using pandas is it possible to replace the re package with the regex package? [duplicate]

I am trying to check for fuzzy match between a string column and a reference list. The string series contains over 1 m rows and the reference list contains over 10 k entries.
For eg:
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k rows
###Output should look like
df['MATCH'] = pd.Series([Nan, 'XANDER', 'MANDER', 'PARIS', 'HARIS', Nan, 'PARIS', Nan])
It should generate match if the word appears separately in the string (and within that, upto 1 char substitution allowed)
For eg - 'PARIS' can match against 'PARIS HILTON', 'THE HARIS DOWNTOWN', but not against 'APARISIAN'.
Similarly, 'XANDER' matches against 'NOVA XANDER' and 'SALA MANDER' (MANDER being 1 char diff from XANDER) , but does not generate match against 'ALEXANDERS'.
As of now, we have written the logic for the same (shown below), although the match takes over 4 hrs to run.. Need to get this to under 30 mins.
Current code -
tags_regex = ref_df['REF_NAMES'].tolist()
tags_ptn_regex = '|'.join([f'\s+{tag}\s+|^{tag}\s+|\s+{tag}$' for tag in tags_regex])
def search_it(partyname):
m = regex.search("("+tags_ptn_regex+ ")"+"{s<=1:[A-Z]}",partyname):
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].str.apply(search_it)
Also, will multiprocessing help with regex ? Many thanks in advance!
Your pattern is rather inefficient, as you repeat tag pattern thrice in the regex. You just need to create a pattern with the so-called whitespace boundaries, (?<!\S) and (?!\S), and you will only need one tag pattern.
Next, if you have several thousands alternative, even the single tag pattern regex will be extremely slow because there can appear such alternatives that match at the same location in the string, and thus, there will be too much backtracking.
To reduce this backtracking, you will need a regex trie.
Here is a working snippet:
import regex
import pandas as pd
## Class to build a regex trie, see https://stackoverflow.com/a/42789508/3832970
class Trie():
"""Regex::Trie in Python. Creates a Trie out of a list of words. The trie can be exported to a Regex pattern.
The corresponding Regex should match much faster than a simple Regex union."""
def __init__(self):
self.data = {}
def add(self, word):
ref = self.data
for char in word:
ref[char] = char in ref and ref[char] or {}
ref = ref[char]
ref[''] = 1
def dump(self):
return self.data
def quote(self, char):
return regex.escape(char)
def _pattern(self, pData):
data = pData
if "" in data and len(data.keys()) == 1:
return None
alt = []
cc = []
q = 0
for char in sorted(data.keys()):
if isinstance(data[char], dict):
try:
recurse = self._pattern(data[char])
alt.append(self.quote(char) + recurse)
except:
cc.append(self.quote(char))
else:
q = 1
cconly = not len(alt) > 0
if len(cc) > 0:
if len(cc) == 1:
alt.append(cc[0])
else:
alt.append('[' + ''.join(cc) + ']')
if len(alt) == 1:
result = alt[0]
else:
result = "(?:" + "|".join(alt) + ")"
if q:
if cconly:
result += "?"
else:
result = "(?:%s)?" % result
return result
def pattern(self):
return self._pattern(self.dump())
## Start of main code
df = pd.DataFrame()
df['NAMES'] = pd.Series(['ALEXANDERS', 'NOVA XANDER', 'SALA MANDER', 'PARIS HILTON', 'THE HARIS DOWNTOWN', 'APARISIAN', 'PARIS', 'MARIN XO']) # 1mil rows
ref_df = pd.DataFrame()
ref_df['REF_NAMES'] = pd.Series(['XANDER','PARIS']) #10 k row
trie = Trie()
for word in ref_df['REF_NAMES'].tolist():
trie.add(word)
tags_ptn_regex = regex.compile(r"(?:(?<!\S)(?:{})(?!\S)){{s<=1:[A-Z]}}".format(trie.pattern()), regex.IGNORECASE)
def search_it(partyname):
m = tags_ptn_regex.search(partyname)
if m is not None:
return m.group()
else:
return None
df['MATCH'] = df['NAMES'].apply(search_it)

How to limit the consecutive occurrences of elements in string using python regular expression?

I have string with repetition of characters. If I want to limit the repetition what would be my pattern?
e.g.
suppose my string is "aaaaajkefhejkffdddddrhigjlkglhhhh" .
I want a output aajkefhejkffddrhigjlkglhh.MOre than 4 consecutive repetitions
should be replaced by two occurrences.
I tried the below piece of pattern.
str1=re.sub(r'(\w)\1+',r'\1{2}',str1)
str1="aaaaajkefhejkffdddddrhigjlkglhhhh"
import re
str1=re.sub(r'(\w)\1+',r'\1{2}',str1)
print (str1)
I expect the output "a{2}jkefhejkf{2}d{2}rhigjlkglh{2}" but the actual output is "aajkefhejkffddrhigjlkglhh"
Try:
import re
str1="aaaaajkefhejkffdddddrhigjlkglhhhhzzz"
str1=re.sub(r'(.)\1{3,}', r"\1\1",str1)
print(str1)
Output:
aajkefhejkffddrhigjlkglhhzzz
input = "aaaaajkefhejkffdddddrhigjlkglhhhh"
def duplicate_character_shortener(input,length):
index = 0
while index<len(input)-length+1:
if input[index:index+length]==input[index+1:index+length+1]:
input = input[:index] + input[index+1:]
else:
index+=1
return input
output = duplicate_character_shortener(input,2)
print(output)
>>> aajkefhejkffddrhigjlkglhh

Text processing to get if else type condition from a string

First of all, I am sorry about the weird question heading. Couldn't express it in one line.
So, the problem statement is,
If I am given the following string --
"('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
I have to parse it as
list1 = ["'James Gosling'", 'jamesgosling', 'jame gosling']
list2 = ["'SUN Microsystem'", 'sunmicrosystem']
list3 = [ list1, list2, keyword]
So that, if I enter James Gosling Sun Microsystem keyword it should tell me that what I have entered is 100% correct
And if I enter J Gosling Sun Microsystem keyword it should say i am only 66.66% correct.
This is what I have tried so far.
import re
def main():
print("starting")
sentence = "('James Gosling'/jamesgosling/jame gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
splited = sentence.split(",")
number_of_primary_keywords = len(splited)
#print(number_of_primary_keywords, "primary keywords length")
number_of_brackets = 0
inside_quotes = ''
inside_quotes_1 = ''
inside_brackets = ''
for n in range(len(splited)):
#print(len(re.findall('\w+', splited[n])), "length of splitted")
inside_brackets = splited[n][splited[n].find("(") + 1: splited[n].find(")")]
synonyms = inside_brackets.split("/")
for x in range(len(synonyms)):
try:
inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]
print(inside_quotes_1)
except:
pass
try:
inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]
print(inside_quotes)
except:
pass
#print(synonyms[x])
number_of_brackets += 1
print(number_of_brackets)
if __name__ == '__main__':
main()
Output is as follows
'James Gosling
jamesgoslin
jame goslin
'SUN Microsystem
SUN Microsystem
sunmicrosyste
sunmicrosyste
3
As you can see, the last letters of some words are missing.
So, if you read this far, I hope you can help me in getting the expected output
Unfortunately, your code has a logic issue that I could not figure it out, however there might be in these lines:
inside_quotes_1 = synonyms[x][synonyms[x].find("\"") + 1: synonyms[n].find("\"")]
inside_quotes = synonyms[x][synonyms[x].find("'") + 1: synonyms[n].find("'")]
which by the way you can simply use:
inside_quotes_1 = synonyms[x][synonyms[x].find("\x22") + 1: synonyms[n].find("\x22")]
inside_quotes = synonyms[x][synonyms[x].find("\x27") + 1: synonyms[n].find("\x27")]
Other than that, you seem to want to extract the words with their indices, which you can extract them using a basic expression:
(\w+)
Then, you might want to find a simple way to locate the indices, where the words are. Then, associate each word to the desired indices.
Example Test
# -*- coding: UTF-8 -*-
import re
string = "('James Gosling'/jamesgosling/james gosling) , ('SUN Microsystem'/sunmicrosystem), keyword"
expression = r'(\w+)'
match = re.search(expression, string)
if match:
print("YAAAY! \"" + match.group(1) + "\" is a match 💚💚💚 ")
else:
print('🙀 Sorry! No matches! Something is not right! Call 911 👮')

Notepad++: Replace find query with words from list

I would like to replace all "var_"
var_
Hello
var_
Whats
var_
Up?
...
with words from this list
alpha
beta
gamma
...
so the end result is
alpha
Hello
beta
Whats
gamma
Up?
...
Would appreciate help on achieving this!
This is sort of impossible / overly complicated with a regex. However, if you combine it with a programming language, you can get it done quickly. E.g. in python it would look like this:
import sys
import re
import fileinput
if len(sys.argv) < 3:
exit("Usage: " + sys.argv[0] + " <filename> <replacements>")
input_file = sys.argv[1]
replacements = sys.argv[2:]
num_of_replacements = len(replacements)
replacement_index = 0
searcher = re.compile("^var_\\b")
for line in fileinput.input(input_file, inplace=True, backup='.bak'):
match = searcher.match(line)
if match is None:
print(line.rstrip())
else:
print(re.sub("^var_\\b", line.rstrip(), replacements[replacement_index]))
replacement_index = replacement_index + 1
Usage: replacer.py ExampleInput.txt alpha beta gamma
Update
It's possible to modify the program to accept the string you search for as the 1st param:
replacer.py "var_" ExampleInput.txt alpha beta gamma
The modified python script looks like this:
import sys
import re
import fileinput
if len(sys.argv) < 4:
exit("Usage: " + sys.argv[0] + " <pattern> <filename> <replacements>")
search = "\\b" + sys.argv[1] + "\\b"
input_file = sys.argv[2]
replacements = sys.argv[3:]
num_of_replacements = len(replacements)
replacement_index = 0
searcher = re.compile(search)
for line in fileinput.input(input_file, inplace=True, backup='.bak'):
match = searcher.match(line)
if match is None:
print(line.rstrip())
else:
print(re.sub(search, line.rstrip(), replacements[replacement_index]))
replacement_index = replacement_index + 1
Note: this script still has a few limitations:
it expects that the string you search for occurs only once each line.
it replaces the searched string only if it's a distinct word
you can accidentally incorporate any python regex syntax into the search param